SDM with Bayesian

Using BARTs to create a species richness map

1. Bayesian statistics

What is bayesian statistics?

Baysian statistics is seen as one of the three general approaches to statistics: there is Freqentist, Likelihood and Bayesian.

To understand what we will be doing, we must first understand what bayesian statistics is. Bayesian statistics is based on Bases´theorem. The Bases´theorem basically proves how one can calculate P(A|B) from knowing P(B|A). The formula is:

P(A|B) = (P(A)P(B|A)) / P(B)

Posterior and Prior

The general underlying concept of bayesian statistics is, that you create a “posterior” distribution from a “prior” distribution. The “prior” should be updated to the “posterior” with new data; new evidence should update prior beliefs. Basically every person can form a prior distribution to a problem. The posterior distribution then is formed by this prior distribution and our observed data, from which we can create a likelihood function.

posterior = likelihood x prior

Sample size influences the effect of the prior on the posterior. The gif below shows a coin toss with a “wrong” prior. After every throw, the posterior is updated by the likelihood function. As flips increase, our posterior starts to match the likelihood data. This shows that wrongly chosen priors can aren’t a problem if you throw enough data at them.

Another important point is, that the posterior will be the new prior. Therefore Bayesian analysis is ultimately “self-correcting”.

Bayesian statistics is a complex but interesting field. For more information look below. More Information: Click here or here to learn about Bayes`theorem and bayesian statistics. Click here if you want to know more about the concept of bayesian statistics, especially the prior and posterior distribution.



Bayesian additive regression trees (BARTs)

In this tutorial we will work with BARTs. Let´s take a quick look at the components:

Regression trees: these are explained in other tutorials. A regression tree is basically a decision tree that is used for the task of regression which can be used to predict continuous valued outputs instead of discrete outputs. If you need a repetition on regression trees, check our other tutorials like random forest or click here.

Additive: This means that not one trees is used, but a sum-of-trees model.

Bayesian: The way Bayesian statistics are incorporated into the additive regression trees is quite complex. Still, the general process is as explained earlier: priors are created, which together with a likelihood function create posterior probabilities which we then use to create our map.

There are priors for the tree structure, leaf values and number of trees. These priors are each created on different rules. An example is the prior for the tree structure, which consists of:
1. The probability that a node at a certain depth is non-terminal.
2. Which covariate is included in the tree.
3. Once we choose a splitting variable which value we use to make a decision.

BART models are fitted with the MCMC algorithm. This algorithm consists of two probability techniques: Monte Carlo simulation technique and the Markov Chain technique. This is some complicated machine learning stuff so we´ll skip it. Just know that the trees are fitted over and over again underlying the rules of this alogorithm. That´s one of the reasons why BARTs are getting more popular: they are computationally intensive and are more easily applicable with ever more powerful computers.

Why BARTs could be good for SDM

Multiple papers suggest that bayesian approaches outperform non-bayesian ones. Also, they dont´t require a lot of data to perform well and are flexible for new data. Moreover, a regularizing prior (shrinkage prior) can help that the trees don´t overfit, a problem which often occurs with other decision-tree-methods.

BARTs and SDM

The method of BARTS are not yet being implemented into SDM. Generally, Bayesian approaches are not yet being used a lot in the field of SDM. I chose the way with BARTs and the package embarcadero. If you want to check out some other packages/ways which try to implement bayesian statistics into SDM: Click here for inlabru: Bayesian Latent Gaussian Modelling using INLA and Extensions. Click here for hSDM: Hierarchical Bayesian Species Distribution Models. Click here for jSDM: Joint Species Distribution Models.



2. Data Preperation

Let´s start preparing our data for BARTs with embarcadero.

Setup

First lets set our working directory.

wd = "C:\\Uni Marburg\\RStudio\\Species_dist_model\\bsc-species-distribution-modelling-2022-Gitxa\\project_tutorial"
setwd(wd)
getwd()

Then lets take a look which libraries we´ll need. We need the rnaturalearth package to download our shape of Pakistan. terra and sf are essential for working with spatial data. Since the packages raster and sp are no longer going to be supported end of 2023, it´s quite nice we dont´t have to use them here. The embarcadero package provides us with many options for BARTS. embarcadero is a wrapper of dbarts and therefore has many similarities, but also some extras.

library(rnaturalearth)
library(terra)
library(sf)
library(embarcadero)



Load and prepare the data

First let´s read in the observation data of butterflies in Pakistan.

butterflies_raw = read.csv (file = file.path(wd, "\\data\\distribution_data\\PakistanLadakh.csv"), sep = ",", dec = ".", fileEncoding="UTF-8-BOM")

Let´s take a quick look at the data and add some names to the columns.

head(butterflies_raw)
##          species        x        y
## 1 Acraea_issoria 73.08635 34.43060
## 2 Acraea_issoria 73.36326 34.24367
## 3 Acraea_issoria 73.36758 34.14221
## 4 Acraea_issoria 73.82637 33.84266
## 5 Acraea_issoria 74.19839 33.94852
## 6  Acraea_violae 72.34951 31.70947
names(butterflies_raw) = c("species", "x", "y") #add some names

Now we load the environmental data. Different predictors can be downloaded from worldclim. In our case we used the already provided predictors: wind, max temperature, min temperature, avarage temperature, radiation, precipitation.

predictors = c(rast(file.path(wd, "data\\environmental_data\\vapr.tif")),
   rast(file.path(wd, "data\\environmental_data\\wind.tif")),
   rast(file.path(wd, "data\\environmental_data\\temp_max.tif")),
   rast(file.path(wd, "data\\environmental_data\\temp_min.tif")),
  rast(file.path(wd, "data\\environmental_data\\temp_avg.tif" )),
  rast(file.path(wd, "data\\environmental_data\\srad.tif")),
 rast(file.path(wd, "data\\environmental_data\\prec.tif"))
)

#add names to predictors
names(predictors) <- c("vapr", "wind", "tmax", "tmin", "t_mean", "srad", "precip")

Later on, we will need Rasterstacks, so let´s stack the predictors and take a quick look if everything worked.

pred_stacked = stack(predictors)

Later on, we will notice that the calculation of our model lets the computer work quite a lot. Since creating a simple model for a single species already took an hour on my computer, it´s unrealistic to attempt this for a couple hundred species, also considering the step.model will need even more calculation. If we look at our pred_stack we see it has 3112686 cells and 7 layers. If we set the resolution down a bit, it will create less cells the computer has to work with and speed up the process. To do this we can use the aggregate function from the terra package. This way, we create a new SpatRaster (in this case Rasterstack) with a lower resolution. This means we have larger cells and our cell amount sinks. For my computer the best possible resolution was going from around 0.008 x 0.008 to 0.04 x 0.04. This way the final loop still took around 7 hours, but the calculations for a single species go quite fast.

pred_stacked #see resolution and CRS
## class      : RasterStack 
## dimensions : 1602, 1943, 3112686, 7  (nrow, ncol, ncell, nlayers)
## resolution : 0.008333333, 0.008333333  (x, y)
## extent     : 60.85, 77.04167, 23.7, 37.05  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +no_defs 
## names      :  vapr,  wind,  tmax,  tmin, t_mean,   srad, precip 
## min values :     ?,     ?,     ?,     ?,      ?,      0, -32768 
## max values :     ?,     ?,     ?,     ?,      ?,  65535,  32767
pred_stack_aggregate = aggregate(pred_stacked, fact=5)
res(pred_stack_aggregate) #check if it worked
## [1] 0.04166667 0.04166667

We also see that pred_stacked had the crs WGS84, so we´ll add this to our the rest of our data in a second.

crs = CRS("+init=epsg:4326")

Now we load the shape of Pakistan to complete the data we need.

shape_Pakistan = ne_countries(scale = 50, type = "countries", country = "Pakistan", returnclass = "sf")
shape_Pakistan = st_transform(shape_Pakistan, st_crs(crs)) #crs to shape



Mask data

If we take a look at our observation data, we see there are some observation points outside of Pakistan, so let´s get rid of those.

butterflies_beforecutout = SpatialPointsDataFrame(cbind(butterflies_raw$x, butterflies_raw$y),  butterflies_raw, proj4string = crs)
plot(pred_stacked$vapr,  main="All oberservation data", xlab="Longitude", ylab="Latitude")
plot(butterflies_beforecutout, col = "blue", add = T, cex = 0.3, pch = 16 )

To be able to cut out the points we convert the oberservation data to a sf object, and then return it again. This was first done in random forest III, I just stole it:D

#to cout out points: sf object
sf_butterfly = st_as_sf(butterflies_raw, coords = c("x", "y"), crs = crs) 

# cutting out the points outside of Pakistan
cutout_butterfly = st_intersection(sf_butterfly, shape_Pakistan) 

# returning the sf object to a data frame
cutout_butterfly_df = as.data.frame(cutout_butterfly) 

# getting the coordinates for the data frame as the command "xy = TRUE" in as.data.frame() didn't work here
coordinates_butterfly = sf::st_coordinates(cutout_butterfly)

# transform the matrix with the coordinates to a data
coords_df = as.data.frame(coordinates_butterfly) 
butterfly = cbind(cutout_butterfly_df$species, coords_df) # combine the cut out observation points with the coordinates

#add names again
names(butterfly) = c("species", "x", "y")

To plot the cutout data we could either create a Spatialpointsdataframe again, or use another method with ggplot.

#points after cutout 
ggplot() + geom_sf(fill = "white", data = shape_Pakistan$geometry) + geom_point(data = butterfly, size = 1, colour = "blue", aes(x = x, y = y)) +  
  labs(x = "Longitude", y = "Latitude", title = "Observation Points of Butterflies")

Now that we gathered all the data, we can start modelling. The model creation will be exemplified by one species. In this case I chose Aglais_caschmirensis (mainly because it was the first species with many observation points).

caschmirensis = butterfly[butterfly$species == "Aglais_caschmirensis",2:3]



Presence data

First, let`s extract the prediction values for our observation data.

#extract data from predictors for species data
ext_casch = terra::extract(predictors, caschmirensis)
#see if it worked
head(ext_casch)
##   ID vapr wind  tmax  tmin t_mean  srad precip
## 1  1 0.49  1.4  10.6  -1.6    4.5 10234     37
## 2  2 0.08  1.8 -12.1 -18.8  -15.4  6923     60
## 3  3 0.20  1.5   2.9 -10.7   -3.9  7510     29
## 4  4 0.21  1.7   0.9 -12.1   -5.6  7715     26
## 5  5 0.16  2.1  -3.3 -15.0   -9.1  7607     28
## 6  6 0.06  2.2 -14.2 -19.2  -16.7  6755     62

We want to get red if ID column.

#get rid of ID column
ext_casch = subset(ext_casch, select = -ID)
#let´s see if that worked
head(ext_casch)
##   vapr wind  tmax  tmin t_mean  srad precip
## 1 0.49  1.4  10.6  -1.6    4.5 10234     37
## 2 0.08  1.8 -12.1 -18.8  -15.4  6923     60
## 3 0.20  1.5   2.9 -10.7   -3.9  7510     29
## 4 0.21  1.7   0.9 -12.1   -5.6  7715     26
## 5 0.16  2.1  -3.3 -15.0   -9.1  7607     28
## 6 0.06  2.2 -14.2 -19.2  -16.7  6755     62
#worked out 



Pseudo-absence data

Like most other SDM methods in r, the BART model itself is run on a data frame of presence–absence or presence–pseudoabsence points and associated environmental covariates. So we first create points all over Pakistan with randomPoints and then extract the data from our predictors. Carlson strongly suggests to create an equal amount of pseudo-absence data to presence data for the best results.

#set seed for reproducible results
set.seed(3456)
#create random Points
absence_b = randomPoints(pred_stack_aggregate,nrow(ext_casch))
#extract data at those coordinates
abs.but = raster::extract(pred_stack_aggregate, absence_b)
#add overserved line to observed data
ext_casch = data.frame(ext_casch); ext_casch$Observed <- 1
#add obversved line to absence data
abs.but = data.frame(abs.but); abs.but$Observed <- 0
#bind them
all.b = rbind(ext_casch, abs.but)
#take a look 
head(all.b)
##   vapr wind  tmax  tmin t_mean  srad precip Observed
## 1 0.49  1.4  10.6  -1.6    4.5 10234     37        1
## 2 0.08  1.8 -12.1 -18.8  -15.4  6923     60        1
## 3 0.20  1.5   2.9 -10.7   -3.9  7510     29        1
## 4 0.21  1.7   0.9 -12.1   -5.6  7715     26        1
## 5 0.16  2.1  -3.3 -15.0   -9.1  7607     28        1
## 6 0.06  2.2 -14.2 -19.2  -16.7  6755     62        1



Check the presence-absence data

Just to go sure, let`s plot the values to check.

plot(all.b$vapr, col =ifelse( all.b$Observed == "0" ,"red", "black"))


So that worked.

Bind coordinates to be able to plot spatially.

abs_pres = rbind(caschmirensis, absence_b)

Create a spdf and plot it spatially.

#one could again do this with sf object if wanted to
spdf_abs_pres = SpatialPointsDataFrame(abs_pres, all.b, proj4string = crs)
#plot the locations
plot(pred_stacked$vapr,  main="Absence (red) and presence (black) data", xlab="Longitude", ylab="Latitude")
plot(spdf_abs_pres, col =ifelse( spdf_abs_pres$Observed == "0" ,"red", "black"), add = T, cex = 0.3, pch = 16 )

#looks great



3. Model the data

Simple models

As a first step we must add test names.

test_names = c("vapr", "wind", "tmax", "tmin", "t_mean", "srad", "precip")

Now we train our first model! We do this with the bart function of embarcadero.

set.seed(4567)

#create model
sdm1 = bart(y.train=all.b[,"Observed"], #dependent variable
            x.train=all.b[,test_names], #explanatory variable
            keeptrees = TRUE)# It's very important this is set to TRUE, else predict function cannot run
## 
## Running BART with binary y
## 
## number of trees: 200
## number of chains: 1, number of threads 1
## tree thinning rate: 1
## Prior:
##  k prior fixed to 2.000000
##  power and base for tree prior: 2.000000 0.950000
##  use quantiles for rule cut points: false
##  proposal probabilities: birth/death 0.50, swap 0.10, change 0.40; birth 0.50
## data:
##  number of training observations: 106
##  number of test observations: 0
##  number of explanatory variables: 7
## 
## Cutoff rules c in x<=c vs x>c
## Number of cutoffs: (var: number of possible c):
## (1: 100) (2: 100) (3: 100) (4: 100) (5: 100) 
## (6: 100) (7: 100) 
## offsets:
##  reg : 0.00 0.00 0.00 0.00 0.00
## Running mcmc loop:
## iteration: 100 (of 1000)
## iteration: 200 (of 1000)
## iteration: 300 (of 1000)
## iteration: 400 (of 1000)
## iteration: 500 (of 1000)
## iteration: 600 (of 1000)
## iteration: 700 (of 1000)
## iteration: 800 (of 1000)
## iteration: 900 (of 1000)
## iteration: 1000 (of 1000)
## total seconds in loop: 0.739198
## 
## Tree sizes, last iteration:
## [1] 2 3 3 3 3 2 3 2 2 2 2 2 2 2 3 2 2 2 
## 2 3 2 2 2 2 2 3 1 2 2 3 2 2 2 2 1 2 4 2 
## 2 5 4 2 2 2 2 2 2 2 2 3 2 2 3 3 2 2 4 2 
## 2 2 1 2 2 3 2 2 2 3 2 2 2 2 2 2 1 2 3 2 
## 4 2 2 2 2 3 3 3 3 2 2 2 4 2 2 3 2 2 1 3 
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 
## 2 2 2 3 3 2 2 1 3 2 3 2 2 3 2 2 2 3 2 2 
## 2 3 3 1 3 2 2 2 3 2 2 2 2 3 2 2 2 3 1 2 
## 2 2 2 3 3 2 2 1 3 2 1 2 3 2 2 2 2 2 3 2 
## 4 3 2 1 2 2 3 2 1 1 2 3 2 2 2 2 2 2 4 2 
## 2 2 
## 
## Variable Usage, last iteration (var:count):
## (1: 30) (2: 44) (3: 40) (4: 35) (5: 34) 
## (6: 32) (7: 28) 
## DONE BART
            #one could add splitby = 10, minimizing memory conflicts and adding a progress bar 

#predict map
map1 = predict(sdm1, pred_stack_aggregate, quiet=TRUE)

Nice, this already gives us some info about the number of trees, the prior, the mcmc loop and even the variable usage.

How’s it look?

plot(map1, main='Predicted probability of Aglais caschmirensis', 
     box=F, axes=F)



Model analysis

To check how good our model is, we use the summary function. In this case, it provides us with loads of information.

summary(sdm1)
## Call:  bart all.b[, test_names] all.b[, "Observed"] TRUE 
##  
## Predictor list: 
##  vapr wind tmax tmin t_mean srad precip 
##  
## Area under the receiver-operator curve 
## AUC = 0.9654681 
##  
## Recommended threshold (maximizes true skill statistic) 
## Cutoff =  0.402945 
## TSS =  0.8490566 
## Resulting type I error rate:  0 
## Resulting type II error rate:  0.1509434


In the embarcadero package the summary function provides us with an optical output and some printed information. Especially interesting are the AUC value, the classified fitted values and the recommended threshold. A high area under the receiver-operator curve and clear visual split in predicted probabilities assigned to the training presences and pseudo-absences indicates that the model has done an adequate job. The summary function also returns an optimal threshold that maximizes the true skill statistic (TSS), and the sensitivity/specificity of the model at that cutoff (alpha).

In our case the model already performed quite well. Our AUC is 0.965 We´ll see if we can make it perform even better in a second. Also, there is a clear visual split in the classified values.

We can also create a Threshold model. The summary function before provided us with information about the best cutoff, so we´ll use this one.

plot(map1>0.4, main='Prediced distribution', 
     box=F, axes=F)



Variable reduction

Now let`s take a look at the variable importance. We can plot our relative contribution of variables depending on number of trees.

set.seed(6789)
varimp.diag(all.b[,test_names], all.b[,'Observed'], iter=5, quiet=TRUE)
## 
##  10 tree models: 5 iterations
## 
##  20 tree models: 5 iterations
## 
##  50 tree models: 5 iterations
## 
##  100 tree models: 5 iterations
## 
##  150 tree models: 5 iterations
## 
##  200 tree models: 5 iterations

In general, as we increase the number of trees, the distribution of the relative importance tends to become flatter. As we increase the number of trees, we demand less predictive power from each tree, this implies that less relevant features have a higher chance to be part of a given tree. On the contrary, if we decrease the number of trees, we demand more from each single tree and this induces a more stringent competition between variables to be part of the trees, as a consequence only the really important variables will be included in the final trees. This is why later in our final loop we will use only 5 trees for our bart.step function. Also this has a nice side effect: the computer can handle the calculation.

In these plots one sees that wind and tmin have the lowest variable importance.

There´s also a tool for step wise variable reduction called variable.step. When applied the following is done:
1. Fit a full model with all predictors and a small tree ensemble (default m = 10), a fixed number of times (default n = 50);
2. Eliminate the least informative variable across all 50 runs;
3. Rerun the models again minus the least informative variable
(n = 50 times again), recording the root mean square error (RMSE; on the training data);
4. Repeat steps 2 and 3 until there are only three covariates left;
5. Finally, select the model with the lowest average RMSE

set.seed(4567)
step.model = variable.step(x.data = all.b[,test_names],
                            y.data = all.b[,'Observed'] )


Our Root mean square error was lowest when 3 variables were dropped, so we will drop wind, tmin and t_mean. When running this you can watch the vaiables get reduced step by step.

So we retrain the model with step.model.

set.seed(4567)
sdm2 = bart(x.train =all.b[, step.model], y.train=all.b[,'Observed'],
            keeptrees = TRUE)
## 
## Running BART with binary y
## 
## number of trees: 200
## number of chains: 1, number of threads 1
## tree thinning rate: 1
## Prior:
##  k prior fixed to 2.000000
##  power and base for tree prior: 2.000000 0.950000
##  use quantiles for rule cut points: false
##  proposal probabilities: birth/death 0.50, swap 0.10, change 0.40; birth 0.50
## data:
##  number of training observations: 106
##  number of test observations: 0
##  number of explanatory variables: 4
## 
## Cutoff rules c in x<=c vs x>c
## Number of cutoffs: (var: number of possible c):
## (1: 100) (2: 100) (3: 100) (4: 100) 
## offsets:
##  reg : 0.00 0.00 0.00 0.00 0.00
## Running mcmc loop:
## iteration: 100 (of 1000)
## iteration: 200 (of 1000)
## iteration: 300 (of 1000)
## iteration: 400 (of 1000)
## iteration: 500 (of 1000)
## iteration: 600 (of 1000)
## iteration: 700 (of 1000)
## iteration: 800 (of 1000)
## iteration: 900 (of 1000)
## iteration: 1000 (of 1000)
## total seconds in loop: 0.778987
## 
## Tree sizes, last iteration:
## [1] 2 2 2 1 3 2 2 3 2 2 2 2 3 2 3 2 3 3 
## 2 3 2 2 3 3 3 3 2 1 3 2 1 2 2 2 2 2 2 2 
## 2 2 2 3 3 3 1 1 2 1 2 2 2 3 3 4 2 2 2 2 
## 3 2 3 1 4 2 2 2 2 2 3 2 2 3 2 2 2 1 2 2 
## 2 2 2 3 2 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 
## 2 2 1 3 3 3 2 3 2 3 2 2 3 3 2 3 2 4 2 2 
## 3 3 2 3 3 3 2 3 3 3 2 2 2 4 2 3 2 2 3 1 
## 3 2 4 2 3 2 3 3 2 3 2 2 1 1 2 3 2 3 2 2 
## 2 2 2 2 2 3 2 2 2 1 3 3 2 2 2 2 3 2 2 2 
## 3 3 2 2 3 2 2 2 2 1 2 2 2 2 4 2 2 2 3 2 
## 2 4 
## 
## Variable Usage, last iteration (var:count):
## (1: 46) (2: 75) (3: 71) (4: 67) 
## DONE BART

And predict the species distribution again.

set.seed(4567)
map2 = predict(sdm2, pred_stack_aggregate, quiet=TRUE)

We can now compare the models.

summary(sdm2)
## Call:  bart all.b[, step.model] all.b[, "Observed"] TRUE 
##  
## Predictor list: 
##  vapr tmax srad precip 
##  
## Area under the receiver-operator curve 
## AUC = 0.9661801 
##  
## Recommended threshold (maximizes true skill statistic) 
## Cutoff =  0.3696936 
## TSS =  0.8490566 
## Resulting type I error rate:  0 
## Resulting type II error rate:  0.1509434


We see that now only four predictors were used. We can also see other small changes. The AUC is now 0.966, just a bit bigger than before.

Let´s compare the threshold maps.

def_par = par() #allows us to set the par back to default afterwards, else with dev.off()
par(mfrow=c(1,2))
plot(map1>0.4, main='Prediced distribution before', 
     box=F, axes=F)
plot(map2>0.37, main='Prediced distribution now', 
     box=F, axes=F)


We do see small differences.

The step wise variable reduction which we performed in multiple steps, can also be done with bart.step all at once. The function does the following: 1. The variable importance diagnostic from varimp.diag is generated.
2. The stepwise variable set reduction in variable.step is run.
3. The final model is run with the reduced predictor set.
4. The variable importance plot from varimp is generated.
5. The model summary from summary is returned.
So when creating our species richness map, we´ll be using this function.

bartstep = bart.step(y.data =all.b[,'Observed'],
     x.data =all.b[,test_names])

The package embarcadero has one last very nice feature. With the function spartial we can project partial dependence surfaces in the real, geography-having world, so we can see the spatial signal of any given variable’s contribution to the model predictions.

set.seed(4567)
plot(spartial(sdm2, pred_stack_aggregate, x.vars='precip',
         equal=TRUE))#, smooth=5)



4. Looping

Prepare Data

To create our desired species richness map, we need somehow do the process of Aglais caschmirensis for all of our species. Since there are species with only one or two observations, this will not be accurate. Since BARTS perform quite well with only little information, we in this case take all species with at least five observation records. With the levels function we get our species names.

butterfly$species = as.factor(butterfly$species)
species = levels(butterfly$species)
species_5 = "Aglais_caschmirensis" #create so we have a base to add to

Then we use a loop (which I again snatched from Leons tutorial of Random Forest III :D) that checks how many observation points a species has and then if they do adds them to the species_5.

for (i in species) {
  sum_butterfly = sum(butterfly$species == i)
  if(sum_butterfly > 4){ species_5 <- append(species_5, subset(species, species == i))}
}

Now we remove Aglais caschmirensis from species_5 because we already added it.

species_5 = species_5[-1]



Our final loop

Now we can finally run our loop. As explained earlier, we predict our model with the bart function. To select the most important variables and use them, we could create our step model with variable.step and then use it to create a new sdm. In this case it´s easier to use the automated bart.step function. To create the loop we will use lists to store our results. This way we can also check if all steps worked the way we wanted them to in hindsight. Since our final goal is to create a threshold model, we will use the data create be the bart.step function with the threshold 0.5. Unfortunately I could not find a way to include the threshold which is proposed by the summary function into the loop. There are some ways to somehow convert the output data into a usable form and then include it into the loop. Though, I could not find a way in our case, because the summary function gives us a unusual output. This can be seen with the attributes function of the summary function. This unfortunately detaines us from using the threshold or AUC data. These can still be looked at for the single species, as I will show later.

If you can`t wait too long, the normal model should be enough, if you have a couple of hours to spare, use the step model. Creating lists such as list_extr_all is useful if you want to check if everything worked afterwards.

#create lists
list_extr_all = list()
list_sdms = list()
list_sdms_step = list()
list_maps = list()
list_maps_step = list()
list_maps_treshold = list()

#loop
for (i in species_5) {
  
  
  #extract coords from data 
  coords_but_all = butterfly[butterfly$species == i,2:3]
  
  #use these coords to extract the data of predictors
  ext_pres_all = terra::extract(predictors, coords_but_all)

  #get rid of ID column
  obs_pres_all = subset(ext_pres_all, select = -ID)
  
  #add observed line to observed data
  obs_pres_all = data.frame(obs_pres_all); obs_pres_all$Observed <- 1

  #now create the absence data
  set.seed(4567)
  absence_b_all = randomPoints(pred_stack_aggregate,nrow(ext_pres_all))
  abs_but_all = terra::extract(pred_stack_aggregate, absence_b_all)
 
  #add obserserved line to absence data
  abs_but_all = data.frame(abs_but_all); abs_but_all$Observed <- 0
  
  #bind so we can create our model from it
  all_b_extr = rbind(obs_pres_all, abs_but_all)
  
  #add binded to list
  list_extr_all[[i]] = all_b_extr

  #create names for test
  test_names = c("vapr", "wind", "tmax", "tmin", "t_mean", "srad", "precip")
  
  set.seed(4567)
  #create model, we do this so we could later compare the normal model to the step model; this also goes a lot faster than the step model
  sdm_all = bart(y.train=all_b_extr[,"Observed"],
                        x.train=all_b_extr[,test_names],
                        keeptrees = TRUE,
                        ntree = 5,
                        nskip = 0) # It's very important this is set to TRUE
  
  #create step model, this takes a couple of hours
  set.seed(4567)
  sdm_all_step = bart.step(y.data =all_b_extr[,'Observed'],
                       x.data =all_b_extr[,test_names],
                       tree.step = 5) 
  
  #add step model to list, useful in case you want to check the summary later on
  list_sdms_step [[i]] = sdm_all_step
  
  #add model to list
  list_sdms[[i]] = sdm_all
  
  #predict maps from model and predictors
  set.seed(4567)
  map_all_step = predict (sdm_all_step, pred_stack_aggregate, splitby=10)
  map_all = predict(sdm_all, pred_stack_aggregate, splitby=10)
  
  #create list from model
  list_maps_step[[i]] = map_all_step
 
  #create treshold maps from maps with step model
  treshold_map = map_all_step > 0.5
  
  #add them to list
  list_maps_treshold[[i]] = treshold_map
  
  #check wha species we are at
  print(paste("Element", i, "has been processed."))
  
}  

We now stack our map and add names.

#stack the map
maps_stacked_step_tresh = stack(list_maps_treshold)

#add names 
names(maps_stacked_step_tresh) = c(species_5)

If one wanted, they could now check out a separate map for every species, as a threshold map or a probability map.

#eg our species: caschmirensis
par(mfrow=c(1,2))
plot(maps_stacked_step$Aglais_caschmirensis, xlab = "Latitude", ylab = "Latitude",main="probability map")
plot(maps_stacked_step_tresh$Aglais_caschmirensis, xlab = "Longitude",main="threshold map")

Also we could look at the AUC or threshold data from our every map using our list of sdms.

summary(list_sdms_step$Aglais_caschmirensis)
## Call:  bart train[, 2:ncol(train)] train[, 1] n.trees TRUE 
##  
## Predictor list: 
##  vapr tmax srad precip 
##  
## Area under the receiver-operator curve 
## AUC = 0.9658241 
##  
## Recommended threshold (maximizes true skill statistic) 
## Cutoff =  0.3693926 
## TSS =  0.8490566 
## Resulting type I error rate:  0 
## Resulting type II error rate:  0.1509434


We could then use the given cutoff of 0.37 to create an even better performing threshold map for each species.

Our species richness map

We want to add all the species to a richness map though. So here is our final result.

richness_map_step_tresh = stackApply(maps_stacked_step_tresh,indices=c(1),fun=sum, na.rm=TRUE)

par(def_par)
plot(richness_map_step_tresh, xlab = "Longitude", ylab = "Latitude",main="Species Richness Map")

Looks good. For a possibly more accurate result, more environmental data could be added (elevation, landcover…).



5. Sources

Bayesian Additive Regression Trees — Bayesian Modeling and Computation in Python (2022). Online available under https://bayesiancomputationbook.com/markdown/chp_07.html. Last checked 04.07.2022.
Carlson, C.J. (2020): embarcadero: Species distribution modelling with Bayesian additive regression trees in r. In: Methods Ecol Evol 11 (7), S. 850–858. Last checked 04.07.2022.
Carlson, C.J. (2020): cjcarlson/pier39: ME&E Paper release - advanced vignette: Zenodo. Last checked 04.07.2022.
Jost, Z. (2017): “Bayesian Additive Regression Trees” paper summary - Towards Data Science. In: Towards Data Science, 07.03.2017. Online available under https://towardsdatascience.com/bayesian-additive-regression-trees-paper-summary-9da19708fa71, last checked 04.07.2022.
Mehta, S. (2022): A beginner’s guide to Bayesian Additive Regression Trees. In: Analytics India Magazine, 27.03.2022. Online available under https://analyticsindiamag.com/a-beginners-guide-to-bayesian-additive-regression-trees/, last checked 04.07.2022.



Session Info

sessionInfo()
## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] embarcadero_1.2.0.1003 data.table_1.14.2      matrixStats_0.62.0    
##  [4] ggpubr_0.4.0           patchwork_1.1.1        ROCR_1.0-11           
##  [7] dismo_1.3-5            forcats_0.5.1          stringr_1.4.0         
## [10] dplyr_1.0.9            purrr_0.3.4            readr_2.1.2           
## [13] tidyr_1.2.0            tibble_3.1.7           ggplot2_3.3.6         
## [16] tidyverse_1.3.1        Metrics_0.1.4          dbarts_0.9-22         
## [19] raster_3.5-15          sp_1.5-0               sf_1.0-7              
## [22] terra_1.5-21           rnaturalearth_0.1.0    rmdformats_1.0.4      
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.5.2                lubridate_1.8.0         httr_1.4.3             
##  [4] tools_4.2.0             backports_1.4.1         bslib_0.3.1            
##  [7] rgdal_1.5-32            utf8_1.2.2              R6_2.5.1               
## [10] KernSmooth_2.23-20      DBI_1.1.3               colorspace_2.0-3       
## [13] withr_2.5.0             rnaturalearthdata_0.1.0 tidyselect_1.1.2       
## [16] compiler_4.2.0          cli_3.3.0               rvest_1.0.2            
## [19] xml2_1.3.3              labeling_0.4.2          bookdown_0.27          
## [22] sass_0.4.1              scales_1.2.0            classInt_0.4-7         
## [25] proxy_0.4-27            digest_0.6.29           rmarkdown_2.14         
## [28] pkgconfig_2.0.3         htmltools_0.5.2         highr_0.9              
## [31] dbplyr_2.2.0            fastmap_1.1.0           rlang_1.0.2            
## [34] readxl_1.4.0            rstudioapi_0.13         farver_2.1.0           
## [37] jquerylib_0.1.4         generics_0.1.2          jsonlite_1.8.0         
## [40] car_3.1-0               magrittr_2.0.3          s2_1.0.7               
## [43] Rcpp_1.0.8.3            munsell_0.5.0           fansi_1.0.3            
## [46] abind_1.4-5             lifecycle_1.0.1         stringi_1.7.6          
## [49] yaml_2.3.5              carData_3.0-5           plyr_1.8.7             
## [52] grid_4.2.0              parallel_4.2.0          crayon_1.5.1           
## [55] lattice_0.20-45         haven_2.5.0             hms_1.1.1              
## [58] knitr_1.39              pillar_1.7.0            ggsignif_0.6.3         
## [61] codetools_0.2-18        wk_0.6.0                reprex_2.0.1           
## [64] glue_1.6.2              evaluate_0.15           modelr_0.1.8           
## [67] vctrs_0.4.1             tzdb_0.3.0              cellranger_1.1.0       
## [70] gtable_0.3.0            reshape_0.8.9           assertthat_0.2.1       
## [73] xfun_0.31               broom_0.8.0             e1071_1.7-11           
## [76] rstatix_0.7.0           class_7.3-20            units_0.8-0            
## [79] ellipsis_0.3.2