SDM with Bayesian
Using BARTs to create a species richness map
1. Bayesian statistics
What is bayesian statistics?
Baysian statistics is seen as one of the three general approaches to statistics: there is Freqentist, Likelihood and Bayesian.
To understand what we will be doing, we must first understand what bayesian statistics is. Bayesian statistics is based on Bases´theorem. The Bases´theorem basically proves how one can calculate P(A|B) from knowing P(B|A). The formula is:
P(A|B) = (P(A)P(B|A)) / P(B)
Posterior and Prior
The general underlying concept of bayesian statistics is, that you create a “posterior” distribution from a “prior” distribution. The “prior” should be updated to the “posterior” with new data; new evidence should update prior beliefs. Basically every person can form a prior distribution to a problem. The posterior distribution then is formed by this prior distribution and our observed data, from which we can create a likelihood function.
posterior = likelihood x prior
Sample size influences the effect of the prior on the posterior. The gif below shows a coin toss with a “wrong” prior. After every throw, the posterior is updated by the likelihood function. As flips increase, our posterior starts to match the likelihood data. This shows that wrongly chosen priors can aren’t a problem if you throw enough data at them.
Another important point is, that the posterior will be the new prior. Therefore Bayesian analysis is ultimately “self-correcting”.
Bayesian statistics is a complex but interesting field. For more information look below. More Information: Click here or here to learn about Bayes`theorem and bayesian statistics. Click here if you want to know more about the concept of bayesian statistics, especially the prior and posterior distribution.
Bayesian additive regression trees (BARTs)
In this tutorial we will work with BARTs. Let´s take a quick look at the components:
Regression trees: these are explained in other tutorials. A regression tree is basically a decision tree that is used for the task of regression which can be used to predict continuous valued outputs instead of discrete outputs. If you need a repetition on regression trees, check our other tutorials like random forest or click here.
Additive: This means that not one trees is used, but a sum-of-trees model.
Bayesian: The way Bayesian statistics are incorporated into the additive regression trees is quite complex. Still, the general process is as explained earlier: priors are created, which together with a likelihood function create posterior probabilities which we then use to create our map.
There are priors for the tree structure, leaf values and number of
trees. These priors are each created on different rules. An example is
the prior for the tree structure, which consists of:
1. The
probability that a node at a certain depth is non-terminal.
2.
Which covariate is included in the tree.
3. Once we choose a
splitting variable which value we use to make a decision.
BART models are fitted with the MCMC algorithm. This algorithm consists of two probability techniques: Monte Carlo simulation technique and the Markov Chain technique. This is some complicated machine learning stuff so we´ll skip it. Just know that the trees are fitted over and over again underlying the rules of this alogorithm. That´s one of the reasons why BARTs are getting more popular: they are computationally intensive and are more easily applicable with ever more powerful computers.
Why BARTs could be good for SDM
Multiple papers suggest that bayesian approaches outperform non-bayesian ones. Also, they dont´t require a lot of data to perform well and are flexible for new data. Moreover, a regularizing prior (shrinkage prior) can help that the trees don´t overfit, a problem which often occurs with other decision-tree-methods.
BARTs and SDM
The method of BARTS are not yet being implemented into SDM.
Generally, Bayesian approaches are not yet being used a lot in the field
of SDM. I chose the way with BARTs and the package
embarcadero
. If you want to check out some other
packages/ways which try to implement bayesian statistics into SDM: Click
here for inlabru
: Bayesian Latent Gaussian Modelling
using INLA and Extensions. Click
here for hSDM
: Hierarchical Bayesian Species
Distribution Models. Click
here for jSDM
: Joint Species Distribution Models.
2. Data Preperation
Let´s start preparing our data for BARTs with
embarcadero
.
Setup
First lets set our working directory.
= "C:\\Uni Marburg\\RStudio\\Species_dist_model\\bsc-species-distribution-modelling-2022-Gitxa\\project_tutorial"
wd setwd(wd)
getwd()
Then lets take a look which libraries we´ll need. We need the
rnaturalearth
package to download our shape of Pakistan.
terra
and sf
are essential for working with
spatial data. Since the packages raster
and sp
are no longer going to be supported end of 2023, it´s quite nice we
dont´t have to use them here. The embarcadero
package
provides us with many options for BARTS. embarcadero
is a
wrapper of dbarts
and therefore has many similarities, but
also some extras.
library(rnaturalearth)
library(terra)
library(sf)
library(embarcadero)
Load and prepare the data
First let´s read in the observation data of butterflies in Pakistan.
= read.csv (file = file.path(wd, "\\data\\distribution_data\\PakistanLadakh.csv"), sep = ",", dec = ".", fileEncoding="UTF-8-BOM") butterflies_raw
Let´s take a quick look at the data and add some names to the columns.
head(butterflies_raw)
## species x y
## 1 Acraea_issoria 73.08635 34.43060
## 2 Acraea_issoria 73.36326 34.24367
## 3 Acraea_issoria 73.36758 34.14221
## 4 Acraea_issoria 73.82637 33.84266
## 5 Acraea_issoria 74.19839 33.94852
## 6 Acraea_violae 72.34951 31.70947
names(butterflies_raw) = c("species", "x", "y") #add some names
Now we load the environmental data. Different predictors can be downloaded from worldclim. In our case we used the already provided predictors: wind, max temperature, min temperature, avarage temperature, radiation, precipitation.
= c(rast(file.path(wd, "data\\environmental_data\\vapr.tif")),
predictors rast(file.path(wd, "data\\environmental_data\\wind.tif")),
rast(file.path(wd, "data\\environmental_data\\temp_max.tif")),
rast(file.path(wd, "data\\environmental_data\\temp_min.tif")),
rast(file.path(wd, "data\\environmental_data\\temp_avg.tif" )),
rast(file.path(wd, "data\\environmental_data\\srad.tif")),
rast(file.path(wd, "data\\environmental_data\\prec.tif"))
)
#add names to predictors
names(predictors) <- c("vapr", "wind", "tmax", "tmin", "t_mean", "srad", "precip")
Later on, we will need Rasterstacks, so let´s stack the predictors and take a quick look if everything worked.
= stack(predictors) pred_stacked
Later on, we will notice that the calculation of our model lets the
computer work quite a lot. Since creating a simple model for a single
species already took an hour on my computer, it´s unrealistic to attempt
this for a couple hundred species, also considering the step.model will
need even more calculation. If we look at our
pred_stack we see it has 3112686 cells and 7 layers. If
we set the resolution down a bit, it will create less cells the computer
has to work with and speed up the process. To do this we can use the
aggregate
function from the terra
package.
This way, we create a new SpatRaster (in this case Rasterstack) with a
lower resolution. This means we have larger cells and our cell amount
sinks. For my computer the best possible resolution was going from
around 0.008 x 0.008 to 0.04 x 0.04. This way the final loop still took
around 7 hours, but the calculations for a single species go quite
fast.
#see resolution and CRS pred_stacked
## class : RasterStack
## dimensions : 1602, 1943, 3112686, 7 (nrow, ncol, ncell, nlayers)
## resolution : 0.008333333, 0.008333333 (x, y)
## extent : 60.85, 77.04167, 23.7, 37.05 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## names : vapr, wind, tmax, tmin, t_mean, srad, precip
## min values : ?, ?, ?, ?, ?, 0, -32768
## max values : ?, ?, ?, ?, ?, 65535, 32767
= aggregate(pred_stacked, fact=5)
pred_stack_aggregate res(pred_stack_aggregate) #check if it worked
## [1] 0.04166667 0.04166667
We also see that pred_stacked had the crs WGS84, so we´ll add this to our the rest of our data in a second.
= CRS("+init=epsg:4326") crs
Now we load the shape of Pakistan to complete the data we need.
= ne_countries(scale = 50, type = "countries", country = "Pakistan", returnclass = "sf")
shape_Pakistan = st_transform(shape_Pakistan, st_crs(crs)) #crs to shape shape_Pakistan
Mask data
If we take a look at our observation data, we see there are some observation points outside of Pakistan, so let´s get rid of those.
= SpatialPointsDataFrame(cbind(butterflies_raw$x, butterflies_raw$y), butterflies_raw, proj4string = crs)
butterflies_beforecutout plot(pred_stacked$vapr, main="All oberservation data", xlab="Longitude", ylab="Latitude")
plot(butterflies_beforecutout, col = "blue", add = T, cex = 0.3, pch = 16 )
To be able to cut out the points we convert the oberservation data to a sf object, and then return it again. This was first done in random forest III, I just stole it:D
#to cout out points: sf object
= st_as_sf(butterflies_raw, coords = c("x", "y"), crs = crs)
sf_butterfly
# cutting out the points outside of Pakistan
= st_intersection(sf_butterfly, shape_Pakistan)
cutout_butterfly
# returning the sf object to a data frame
= as.data.frame(cutout_butterfly)
cutout_butterfly_df
# getting the coordinates for the data frame as the command "xy = TRUE" in as.data.frame() didn't work here
= sf::st_coordinates(cutout_butterfly)
coordinates_butterfly
# transform the matrix with the coordinates to a data
= as.data.frame(coordinates_butterfly)
coords_df = cbind(cutout_butterfly_df$species, coords_df) # combine the cut out observation points with the coordinates
butterfly
#add names again
names(butterfly) = c("species", "x", "y")
To plot the cutout data we could either create a Spatialpointsdataframe again, or use another method with ggplot.
#points after cutout
ggplot() + geom_sf(fill = "white", data = shape_Pakistan$geometry) + geom_point(data = butterfly, size = 1, colour = "blue", aes(x = x, y = y)) +
labs(x = "Longitude", y = "Latitude", title = "Observation Points of Butterflies")
Now that we gathered all the data, we can start modelling. The model creation will be exemplified by one species. In this case I chose Aglais_caschmirensis (mainly because it was the first species with many observation points).
= butterfly[butterfly$species == "Aglais_caschmirensis",2:3] caschmirensis
Presence data
First, let`s extract the prediction values for our observation data.
#extract data from predictors for species data
= terra::extract(predictors, caschmirensis)
ext_casch #see if it worked
head(ext_casch)
## ID vapr wind tmax tmin t_mean srad precip
## 1 1 0.49 1.4 10.6 -1.6 4.5 10234 37
## 2 2 0.08 1.8 -12.1 -18.8 -15.4 6923 60
## 3 3 0.20 1.5 2.9 -10.7 -3.9 7510 29
## 4 4 0.21 1.7 0.9 -12.1 -5.6 7715 26
## 5 5 0.16 2.1 -3.3 -15.0 -9.1 7607 28
## 6 6 0.06 2.2 -14.2 -19.2 -16.7 6755 62
We want to get red if ID column.
#get rid of ID column
= subset(ext_casch, select = -ID)
ext_casch #let´s see if that worked
head(ext_casch)
## vapr wind tmax tmin t_mean srad precip
## 1 0.49 1.4 10.6 -1.6 4.5 10234 37
## 2 0.08 1.8 -12.1 -18.8 -15.4 6923 60
## 3 0.20 1.5 2.9 -10.7 -3.9 7510 29
## 4 0.21 1.7 0.9 -12.1 -5.6 7715 26
## 5 0.16 2.1 -3.3 -15.0 -9.1 7607 28
## 6 0.06 2.2 -14.2 -19.2 -16.7 6755 62
#worked out
Pseudo-absence data
Like most other SDM methods in r, the BART model itself is run on a
data frame of presence–absence or presence–pseudoabsence points and
associated environmental covariates. So we first create points all over
Pakistan with randomPoints
and then extract the data from
our predictors. Carlson strongly suggests to create an equal
amount of pseudo-absence data to presence data for the best results.
#set seed for reproducible results
set.seed(3456)
#create random Points
= randomPoints(pred_stack_aggregate,nrow(ext_casch))
absence_b #extract data at those coordinates
= raster::extract(pred_stack_aggregate, absence_b)
abs.but #add overserved line to observed data
= data.frame(ext_casch); ext_casch$Observed <- 1
ext_casch #add obversved line to absence data
= data.frame(abs.but); abs.but$Observed <- 0
abs.but #bind them
= rbind(ext_casch, abs.but)
all.b #take a look
head(all.b)
## vapr wind tmax tmin t_mean srad precip Observed
## 1 0.49 1.4 10.6 -1.6 4.5 10234 37 1
## 2 0.08 1.8 -12.1 -18.8 -15.4 6923 60 1
## 3 0.20 1.5 2.9 -10.7 -3.9 7510 29 1
## 4 0.21 1.7 0.9 -12.1 -5.6 7715 26 1
## 5 0.16 2.1 -3.3 -15.0 -9.1 7607 28 1
## 6 0.06 2.2 -14.2 -19.2 -16.7 6755 62 1
Check the presence-absence data
Just to go sure, let`s plot the values to check.
plot(all.b$vapr, col =ifelse( all.b$Observed == "0" ,"red", "black"))
So that worked.
Bind coordinates to be able to plot spatially.
= rbind(caschmirensis, absence_b) abs_pres
Create a spdf and plot it spatially.
#one could again do this with sf object if wanted to
= SpatialPointsDataFrame(abs_pres, all.b, proj4string = crs)
spdf_abs_pres #plot the locations
plot(pred_stacked$vapr, main="Absence (red) and presence (black) data", xlab="Longitude", ylab="Latitude")
plot(spdf_abs_pres, col =ifelse( spdf_abs_pres$Observed == "0" ,"red", "black"), add = T, cex = 0.3, pch = 16 )
#looks great
3. Model the data
Simple models
As a first step we must add test names.
= c("vapr", "wind", "tmax", "tmin", "t_mean", "srad", "precip") test_names
Now we train our first model! We do this with the bart
function of embarcadero
.
set.seed(4567)
#create model
= bart(y.train=all.b[,"Observed"], #dependent variable
sdm1 x.train=all.b[,test_names], #explanatory variable
keeptrees = TRUE)# It's very important this is set to TRUE, else predict function cannot run
##
## Running BART with binary y
##
## number of trees: 200
## number of chains: 1, number of threads 1
## tree thinning rate: 1
## Prior:
## k prior fixed to 2.000000
## power and base for tree prior: 2.000000 0.950000
## use quantiles for rule cut points: false
## proposal probabilities: birth/death 0.50, swap 0.10, change 0.40; birth 0.50
## data:
## number of training observations: 106
## number of test observations: 0
## number of explanatory variables: 7
##
## Cutoff rules c in x<=c vs x>c
## Number of cutoffs: (var: number of possible c):
## (1: 100) (2: 100) (3: 100) (4: 100) (5: 100)
## (6: 100) (7: 100)
## offsets:
## reg : 0.00 0.00 0.00 0.00 0.00
## Running mcmc loop:
## iteration: 100 (of 1000)
## iteration: 200 (of 1000)
## iteration: 300 (of 1000)
## iteration: 400 (of 1000)
## iteration: 500 (of 1000)
## iteration: 600 (of 1000)
## iteration: 700 (of 1000)
## iteration: 800 (of 1000)
## iteration: 900 (of 1000)
## iteration: 1000 (of 1000)
## total seconds in loop: 0.739198
##
## Tree sizes, last iteration:
## [1] 2 3 3 3 3 2 3 2 2 2 2 2 2 2 3 2 2 2
## 2 3 2 2 2 2 2 3 1 2 2 3 2 2 2 2 1 2 4 2
## 2 5 4 2 2 2 2 2 2 2 2 3 2 2 3 3 2 2 4 2
## 2 2 1 2 2 3 2 2 2 3 2 2 2 2 2 2 1 2 3 2
## 4 2 2 2 2 3 3 3 3 2 2 2 4 2 2 3 2 2 1 3
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1
## 2 2 2 3 3 2 2 1 3 2 3 2 2 3 2 2 2 3 2 2
## 2 3 3 1 3 2 2 2 3 2 2 2 2 3 2 2 2 3 1 2
## 2 2 2 3 3 2 2 1 3 2 1 2 3 2 2 2 2 2 3 2
## 4 3 2 1 2 2 3 2 1 1 2 3 2 2 2 2 2 2 4 2
## 2 2
##
## Variable Usage, last iteration (var:count):
## (1: 30) (2: 44) (3: 40) (4: 35) (5: 34)
## (6: 32) (7: 28)
## DONE BART
#one could add splitby = 10, minimizing memory conflicts and adding a progress bar
#predict map
= predict(sdm1, pred_stack_aggregate, quiet=TRUE) map1
Nice, this already gives us some info about the number of trees, the prior, the mcmc loop and even the variable usage.
How’s it look?
plot(map1, main='Predicted probability of Aglais caschmirensis',
box=F, axes=F)
Model analysis
To check how good our model is, we use the summary
function. In this case, it provides us with loads of information.
summary(sdm1)
## Call: bart all.b[, test_names] all.b[, "Observed"] TRUE
##
## Predictor list:
## vapr wind tmax tmin t_mean srad precip
##
## Area under the receiver-operator curve
## AUC = 0.9654681
##
## Recommended threshold (maximizes true skill statistic)
## Cutoff = 0.402945
## TSS = 0.8490566
## Resulting type I error rate: 0
## Resulting type II error rate: 0.1509434
In the embarcadero package the summary function provides us with an
optical output and some printed information. Especially interesting are
the AUC value, the classified fitted values and the recommended
threshold. A high area under the receiver-operator curve and clear
visual split in predicted probabilities assigned to the training
presences and pseudo-absences indicates that the model has done an
adequate job. The summary function also returns an optimal threshold
that maximizes the true skill statistic (TSS), and the
sensitivity/specificity of the model at that cutoff (alpha).
In our case the model already performed quite well. Our AUC is 0.965 We´ll see if we can make it perform even better in a second. Also, there is a clear visual split in the classified values.
We can also create a Threshold model. The summary function before provided us with information about the best cutoff, so we´ll use this one.
plot(map1>0.4, main='Prediced distribution',
box=F, axes=F)
Variable reduction
Now let`s take a look at the variable importance. We can plot our relative contribution of variables depending on number of trees.
set.seed(6789)
varimp.diag(all.b[,test_names], all.b[,'Observed'], iter=5, quiet=TRUE)
##
## 10 tree models: 5 iterations
##
## 20 tree models: 5 iterations
##
## 50 tree models: 5 iterations
##
## 100 tree models: 5 iterations
##
## 150 tree models: 5 iterations
##
## 200 tree models: 5 iterations
In general, as we increase the number of trees, the distribution of
the relative importance tends to become flatter. As we increase the
number of trees, we demand less predictive power from each tree, this
implies that less relevant features have a higher chance to be part of a
given tree. On the contrary, if we decrease the number of trees, we
demand more from each single tree and this induces a more stringent
competition between variables to be part of the trees, as a consequence
only the really important variables will be included in the final trees.
This is why later in our final loop we will use only 5 trees for our
bart.step
function. Also this has a nice side effect: the
computer can handle the calculation.
In these plots one sees that wind and tmin have the lowest variable importance.
There´s also a tool for step wise variable reduction called
variable.step
. When applied the following is done:
1.
Fit a full model with all predictors and a small tree ensemble (default
m = 10), a fixed number of times (default n = 50);
2. Eliminate the
least informative variable across all 50 runs;
3. Rerun the models
again minus the least informative variable
(n = 50 times again),
recording the root mean square error (RMSE; on the training data);
4. Repeat steps 2 and 3 until there are only three covariates left;
5. Finally, select the model with the lowest average RMSE
set.seed(4567)
= variable.step(x.data = all.b[,test_names],
step.model y.data = all.b[,'Observed'] )
Our Root mean square error was lowest when 3 variables were
dropped, so we will drop wind, tmin and
t_mean. When running this you can watch the vaiables get
reduced step by step.
So we retrain the model with step.model.
set.seed(4567)
= bart(x.train =all.b[, step.model], y.train=all.b[,'Observed'],
sdm2 keeptrees = TRUE)
##
## Running BART with binary y
##
## number of trees: 200
## number of chains: 1, number of threads 1
## tree thinning rate: 1
## Prior:
## k prior fixed to 2.000000
## power and base for tree prior: 2.000000 0.950000
## use quantiles for rule cut points: false
## proposal probabilities: birth/death 0.50, swap 0.10, change 0.40; birth 0.50
## data:
## number of training observations: 106
## number of test observations: 0
## number of explanatory variables: 4
##
## Cutoff rules c in x<=c vs x>c
## Number of cutoffs: (var: number of possible c):
## (1: 100) (2: 100) (3: 100) (4: 100)
## offsets:
## reg : 0.00 0.00 0.00 0.00 0.00
## Running mcmc loop:
## iteration: 100 (of 1000)
## iteration: 200 (of 1000)
## iteration: 300 (of 1000)
## iteration: 400 (of 1000)
## iteration: 500 (of 1000)
## iteration: 600 (of 1000)
## iteration: 700 (of 1000)
## iteration: 800 (of 1000)
## iteration: 900 (of 1000)
## iteration: 1000 (of 1000)
## total seconds in loop: 0.778987
##
## Tree sizes, last iteration:
## [1] 2 2 2 1 3 2 2 3 2 2 2 2 3 2 3 2 3 3
## 2 3 2 2 3 3 3 3 2 1 3 2 1 2 2 2 2 2 2 2
## 2 2 2 3 3 3 1 1 2 1 2 2 2 3 3 4 2 2 2 2
## 3 2 3 1 4 2 2 2 2 2 3 2 2 3 2 2 2 1 2 2
## 2 2 2 3 2 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3
## 2 2 1 3 3 3 2 3 2 3 2 2 3 3 2 3 2 4 2 2
## 3 3 2 3 3 3 2 3 3 3 2 2 2 4 2 3 2 2 3 1
## 3 2 4 2 3 2 3 3 2 3 2 2 1 1 2 3 2 3 2 2
## 2 2 2 2 2 3 2 2 2 1 3 3 2 2 2 2 3 2 2 2
## 3 3 2 2 3 2 2 2 2 1 2 2 2 2 4 2 2 2 3 2
## 2 4
##
## Variable Usage, last iteration (var:count):
## (1: 46) (2: 75) (3: 71) (4: 67)
## DONE BART
And predict the species distribution again.
set.seed(4567)
= predict(sdm2, pred_stack_aggregate, quiet=TRUE) map2
We can now compare the models.
summary(sdm2)
## Call: bart all.b[, step.model] all.b[, "Observed"] TRUE
##
## Predictor list:
## vapr tmax srad precip
##
## Area under the receiver-operator curve
## AUC = 0.9661801
##
## Recommended threshold (maximizes true skill statistic)
## Cutoff = 0.3696936
## TSS = 0.8490566
## Resulting type I error rate: 0
## Resulting type II error rate: 0.1509434
We see that now only four predictors were used. We can also see
other small changes. The AUC is now 0.966, just a bit bigger than
before.
Let´s compare the threshold maps.
= par() #allows us to set the par back to default afterwards, else with dev.off()
def_par par(mfrow=c(1,2))
plot(map1>0.4, main='Prediced distribution before',
box=F, axes=F)
plot(map2>0.37, main='Prediced distribution now',
box=F, axes=F)
We do see small differences.
The step wise variable reduction which we performed in multiple
steps, can also be done with bart.step
all at once. The
function does the following: 1. The variable importance diagnostic from
varimp.diag
is generated.
2. The stepwise variable set
reduction in variable.step
is run.
3. The final model
is run with the reduced predictor set.
4. The variable importance
plot from varimp
is generated.
5. The model summary
from summary
is returned.
So when creating our species
richness map, we´ll be using this function.
= bart.step(y.data =all.b[,'Observed'],
bartstep x.data =all.b[,test_names])
The package embarcadero
has one last very nice feature.
With the function spartial
we can project partial
dependence surfaces in the real, geography-having world, so we can see
the spatial signal of any given variable’s contribution to the model
predictions.
set.seed(4567)
plot(spartial(sdm2, pred_stack_aggregate, x.vars='precip',
equal=TRUE))#, smooth=5)
4. Looping
Prepare Data
To create our desired species richness map, we need somehow do the
process of Aglais caschmirensis for all of our species. Since
there are species with only one or two observations, this will not be
accurate. Since BARTS perform quite well with only little information,
we in this case take all species with at least five observation records.
With the levels
function we get our species names.
$species = as.factor(butterfly$species)
butterfly= levels(butterfly$species)
species = "Aglais_caschmirensis" #create so we have a base to add to species_5
Then we use a loop (which I again snatched from Leons tutorial of Random Forest III :D) that checks how many observation points a species has and then if they do adds them to the species_5.
for (i in species) {
= sum(butterfly$species == i)
sum_butterfly if(sum_butterfly > 4){ species_5 <- append(species_5, subset(species, species == i))}
}
Now we remove Aglais caschmirensis from species_5 because we already added it.
= species_5[-1] species_5
Our final loop
Now we can finally run our loop. As explained earlier, we predict our
model with the bart
function. To select the most important
variables and use them, we could create our step model with
variable.step
and then use it to create a new sdm. In this
case it´s easier to use the automated bart.step
function.
To create the loop we will use lists to store our results. This way we
can also check if all steps worked the way we wanted them to in
hindsight. Since our final goal is to create a threshold model, we will
use the data create be the bart.step
function with the
threshold 0.5. Unfortunately I could not find a way to include the
threshold which is proposed by the summary
function into
the loop. There are some ways to somehow convert the output data into a
usable form and then include it into the loop. Though, I could not find
a way in our case, because the summary
function gives us a
unusual output. This can be seen with the attributes
function of the summary
function. This unfortunately
detaines us from using the threshold or AUC data. These can still be
looked at for the single species, as I will show later.
If you can`t wait too long, the normal model should be enough, if you have a couple of hours to spare, use the step model. Creating lists such as list_extr_all is useful if you want to check if everything worked afterwards.
#create lists
= list()
list_extr_all = list()
list_sdms = list()
list_sdms_step = list()
list_maps = list()
list_maps_step = list()
list_maps_treshold
#loop
for (i in species_5) {
#extract coords from data
= butterfly[butterfly$species == i,2:3]
coords_but_all
#use these coords to extract the data of predictors
= terra::extract(predictors, coords_but_all)
ext_pres_all
#get rid of ID column
= subset(ext_pres_all, select = -ID)
obs_pres_all
#add observed line to observed data
= data.frame(obs_pres_all); obs_pres_all$Observed <- 1
obs_pres_all
#now create the absence data
set.seed(4567)
= randomPoints(pred_stack_aggregate,nrow(ext_pres_all))
absence_b_all = terra::extract(pred_stack_aggregate, absence_b_all)
abs_but_all
#add obserserved line to absence data
= data.frame(abs_but_all); abs_but_all$Observed <- 0
abs_but_all
#bind so we can create our model from it
= rbind(obs_pres_all, abs_but_all)
all_b_extr
#add binded to list
= all_b_extr
list_extr_all[[i]]
#create names for test
= c("vapr", "wind", "tmax", "tmin", "t_mean", "srad", "precip")
test_names
set.seed(4567)
#create model, we do this so we could later compare the normal model to the step model; this also goes a lot faster than the step model
= bart(y.train=all_b_extr[,"Observed"],
sdm_all x.train=all_b_extr[,test_names],
keeptrees = TRUE,
ntree = 5,
nskip = 0) # It's very important this is set to TRUE
#create step model, this takes a couple of hours
set.seed(4567)
= bart.step(y.data =all_b_extr[,'Observed'],
sdm_all_step x.data =all_b_extr[,test_names],
tree.step = 5)
#add step model to list, useful in case you want to check the summary later on
= sdm_all_step
list_sdms_step [[i]]
#add model to list
= sdm_all
list_sdms[[i]]
#predict maps from model and predictors
set.seed(4567)
= predict (sdm_all_step, pred_stack_aggregate, splitby=10)
map_all_step = predict(sdm_all, pred_stack_aggregate, splitby=10)
map_all
#create list from model
= map_all_step
list_maps_step[[i]]
#create treshold maps from maps with step model
= map_all_step > 0.5
treshold_map
#add them to list
= treshold_map
list_maps_treshold[[i]]
#check wha species we are at
print(paste("Element", i, "has been processed."))
}
We now stack our map and add names.
#stack the map
= stack(list_maps_treshold)
maps_stacked_step_tresh
#add names
names(maps_stacked_step_tresh) = c(species_5)
If one wanted, they could now check out a separate map for every species, as a threshold map or a probability map.
#eg our species: caschmirensis
par(mfrow=c(1,2))
plot(maps_stacked_step$Aglais_caschmirensis, xlab = "Latitude", ylab = "Latitude",main="probability map")
plot(maps_stacked_step_tresh$Aglais_caschmirensis, xlab = "Longitude",main="threshold map")
Also we could look at the AUC or threshold data from our every map using our list of sdms.
summary(list_sdms_step$Aglais_caschmirensis)
## Call: bart train[, 2:ncol(train)] train[, 1] n.trees TRUE
##
## Predictor list:
## vapr tmax srad precip
##
## Area under the receiver-operator curve
## AUC = 0.9658241
##
## Recommended threshold (maximizes true skill statistic)
## Cutoff = 0.3693926
## TSS = 0.8490566
## Resulting type I error rate: 0
## Resulting type II error rate: 0.1509434
We could then use the given cutoff of 0.37 to create an even better
performing threshold map for each species.
Our species richness map
We want to add all the species to a richness map though. So here is our final result.
= stackApply(maps_stacked_step_tresh,indices=c(1),fun=sum, na.rm=TRUE)
richness_map_step_tresh
par(def_par)
plot(richness_map_step_tresh, xlab = "Longitude", ylab = "Latitude",main="Species Richness Map")
Looks good. For a possibly more accurate result, more environmental data could be added (elevation, landcover…).
5. Sources
Bayesian Additive Regression Trees — Bayesian Modeling and
Computation in Python (2022). Online available under https://bayesiancomputationbook.com/markdown/chp_07.html.
Last checked 04.07.2022.
Carlson, C.J. (2020):
embarcadero: Species distribution modelling with Bayesian additive
regression trees in r. In: Methods Ecol Evol 11 (7), S. 850–858. Last
checked 04.07.2022.
Carlson, C.J. (2020):
cjcarlson/pier39: ME&E Paper release - advanced vignette: Zenodo.
Last checked 04.07.2022.
Jost, Z. (2017): “Bayesian
Additive Regression Trees” paper summary - Towards Data Science. In:
Towards Data Science, 07.03.2017. Online available under https://towardsdatascience.com/bayesian-additive-regression-trees-paper-summary-9da19708fa71,
last checked 04.07.2022.
Mehta, S. (2022): A beginner’s
guide to Bayesian Additive Regression Trees. In: Analytics India
Magazine, 27.03.2022. Online available under https://analyticsindiamag.com/a-beginners-guide-to-bayesian-additive-regression-trees/,
last checked 04.07.2022.
Session Info
sessionInfo()
## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] embarcadero_1.2.0.1003 data.table_1.14.2 matrixStats_0.62.0
## [4] ggpubr_0.4.0 patchwork_1.1.1 ROCR_1.0-11
## [7] dismo_1.3-5 forcats_0.5.1 stringr_1.4.0
## [10] dplyr_1.0.9 purrr_0.3.4 readr_2.1.2
## [13] tidyr_1.2.0 tibble_3.1.7 ggplot2_3.3.6
## [16] tidyverse_1.3.1 Metrics_0.1.4 dbarts_0.9-22
## [19] raster_3.5-15 sp_1.5-0 sf_1.0-7
## [22] terra_1.5-21 rnaturalearth_0.1.0 rmdformats_1.0.4
##
## loaded via a namespace (and not attached):
## [1] fs_1.5.2 lubridate_1.8.0 httr_1.4.3
## [4] tools_4.2.0 backports_1.4.1 bslib_0.3.1
## [7] rgdal_1.5-32 utf8_1.2.2 R6_2.5.1
## [10] KernSmooth_2.23-20 DBI_1.1.3 colorspace_2.0-3
## [13] withr_2.5.0 rnaturalearthdata_0.1.0 tidyselect_1.1.2
## [16] compiler_4.2.0 cli_3.3.0 rvest_1.0.2
## [19] xml2_1.3.3 labeling_0.4.2 bookdown_0.27
## [22] sass_0.4.1 scales_1.2.0 classInt_0.4-7
## [25] proxy_0.4-27 digest_0.6.29 rmarkdown_2.14
## [28] pkgconfig_2.0.3 htmltools_0.5.2 highr_0.9
## [31] dbplyr_2.2.0 fastmap_1.1.0 rlang_1.0.2
## [34] readxl_1.4.0 rstudioapi_0.13 farver_2.1.0
## [37] jquerylib_0.1.4 generics_0.1.2 jsonlite_1.8.0
## [40] car_3.1-0 magrittr_2.0.3 s2_1.0.7
## [43] Rcpp_1.0.8.3 munsell_0.5.0 fansi_1.0.3
## [46] abind_1.4-5 lifecycle_1.0.1 stringi_1.7.6
## [49] yaml_2.3.5 carData_3.0-5 plyr_1.8.7
## [52] grid_4.2.0 parallel_4.2.0 crayon_1.5.1
## [55] lattice_0.20-45 haven_2.5.0 hms_1.1.1
## [58] knitr_1.39 pillar_1.7.0 ggsignif_0.6.3
## [61] codetools_0.2-18 wk_0.6.0 reprex_2.0.1
## [64] glue_1.6.2 evaluate_0.15 modelr_0.1.8
## [67] vctrs_0.4.1 tzdb_0.3.0 cellranger_1.1.0
## [70] gtable_0.3.0 reshape_0.8.9 assertthat_0.2.1
## [73] xfun_0.31 broom_0.8.0 e1071_1.7-11
## [76] rstatix_0.7.0 class_7.3-20 units_0.8-0
## [79] ellipsis_0.3.2