Boosted Regression Tree

Introduction

The Boosted Regression Trees (BRT) method is based on the concept and algorithms of decision trees and boosting. In this model, predictions are represented by decisions regarding features or data using “trees”. Gradually, the individual terms, which are represented by individual trees, are adjusted forward and a decision network is created.

It is important to mention that the BRT method has many extended advantages. In addition to the inclusion of various variables, missing data are also taken into account. Likewise, they can handle data sets with outliers just as well and use complex nonlinear relationships between data sets.

Overall, the concept of Boosted Regression Trees offers good predictive performance in the ecological domain (Elith et al. 2008: 802).

Theory

Regression Trees

Regression trees are very popular as a method because the visualization of parameters, variables and the decisions are very easy to understand. Likewise, any type of parameters, such as numeric, binary, categorical, etc., are possible in decision trees (Elith et al. 2008: 803).

Regression Tree Structure, JavaTPiont.com

As mentioned above, regression trees are less prone to data outliers and can replace missing data with surrogates. The hierarchical structure of the tree from top to bottom is also the decision and reading direction. Here, the next lower value depends on the input variables above it (Elith et al. 2008: 803). The data used to create the trees are randomly selected from the entire data set and always have the same number. After the data are used, they are packed back into the dataset and can be used in further trees (BCCVL, 02.06.2023).

A disadvantage of trees is that the tree structure changes when the training data changes. If different data are used in model development, different splitting series and structures must be expected. This disadvantage leads to uncertainties in the predictive power and subsequent interpretation of the modeled species distribution (Elith et al. 2008: 804).

Boosting

“Boosting is a method of improving model accuracy based on the idea that it is easier to find and average many rough rules of thumb than to find a single, highly accurate predictive rule” (Elith et al. 2008: 804).

There are many different boosting algorithms, which can be used for specfic purposes and data. In addition to the specific setting parameters, the setting for the next repetition stage also differs. Overall, it can be said that boosting is an optimization method that is used to minimize error/data loss/mistake decisions. Thus, in each step of boosting, a new branch/tree is created, which incorporates the best possible error reduction from the previous steps (Elith et al. 2008: 804).

Combination/Advantages

Combining the method of boosting and regression trees, one obtains an improved and more controlled creation of regression trees. The probabilities and predictions become more accurate, which can be justified by the typical stepwise generation.

In summary, there are three important advantages of the BRT method. First, the entire process is stochastic and contains random and probability components, which improves the prediction performance. Likewise, as previously mentioned, the model is stepwise, so after each step the parameters are adjusted to fit the previously created tree and dataset. Last, the tuning parameters are well adjustable and both parameters of Learning Rate and Tree Complexity significantly determine the number of trees required for the optimal type through the dataset (Elith et al. 2008: 804).

Data & Method

For the following tutorial we are using a data set of butterflies found in Pakistan. These are taken from the course Ilias folder. The CSV file contains several thousand occurrence points of various species.

Example of butterfly species: Vanessa Cardui, Wikipedia.com

The environmental layers for Pakistan are taken from the geodata package and contain the state boundaries and the Bioclim Vairablen, which are needed for cropping. The variables and data are thus directly included in the implementation process in R Studio through the packages.

The methods used for the final model is the gbm.step method from the package Dismo. It can be configured by the user through the parameters of Tree Complexity and Learining Rate. While the Tree Complexity controls the number of divisions in each tree, the Learning Rate defines the contribution of each tree on the whole model. Consequently, it can be said that the lower the value of the Learning rate, the more trees are created and the less the contribution of the individual (BCCVL, 03.06.2023).

Tree Complexity and Learning Rate, bccvl.org

Implementation in RStudio

Getting started

Before we begin with working with the data, we have to install the packages that will be used in this tutorial as well as set our working directory.

library(terra)
library(dismo)
library(gbm)
library(dplyr)
library(sp)
library(raster)

setwd("D:/SDM/sdm-2023-Remmette/Project/Input/")

Loading data

As a first step we have to load the different datasets in RStudio to be able to work with them:

First we load the distribution data of all butterfly sightings in Pakistan. Since it is a .csv-file we have to read it from our directory.

distribution_data <-read.csv("PakistanLadakh.csv")

Afterwards we load the Pakistani border as well as their climate data directly through a package called “geodata” in RStudio and save it in our directory. The function for the adminstrative boundraries is “gadm” and for the climate data by country “worldclim_country”.

pak_border <- geodata::gadm(country="Pakistan", path="D:/SDM/sdm-2023-Remmette/Project/Input/")

env_layers = geodata::worldclim_country(country="PAK", res=10, var="bio", path="D:/SDM/sdm-2023-Remmette/Project/Input/", version = "2.1")

Data Processing

In the next step we will process the data to bring it in the right format and prepare it for model-building.

Environmental layers

The climate data we just saved in our environmental layers will get renamed by shortening the name with the function “substr” to only take the 11th to 20th characters per name. The description of all the bio_clim-variables can be found at the following link.

names(env_layers) <- substr(names(env_layers),11,20)

names(env_layers)

##  [1] "bio_1"  "bio_2"  "bio_3"  "bio_4"  "bio_5"  "bio_6"  "bio_7"  "bio_8" 
##  [9] "bio_9"  "bio_10" "bio_11" "bio_12" "bio_13" "bio_14" "bio_15" "bio_16"
## [17] "bio_17" "bio_18" "bio_19"

Furthermore we will crop the environmental layers to match the border we initialized beforehand. This will be done with the “mask”-function.

env_layers=terra::mask(env_layers, pak_border)

Species data

For the species data (distribution_data) we are starting of with renaming the columns to bring a bit more clarity in the data set.

colnames(distribution_data) <- c("species", "x", "y")

Afterwards we need to trim the data by filtering only for species with more than 40 occurrences, since it’s not as reliable to work with species that have too few data points and also it quickens the processing. For this we group the data by species and afterwards put it through a filter with n()>40.

filtered_species <- distribution_data %>% group_by(species) %>% filter(n() > 40)

For the next steps the data type of our species data has to be changed from a “data frame” to a “sf object”, transforming it into spatial data.

filtered_species <- sf::st_as_sf(filtered_species, coords=c("x", "y"), remove=F, crs=sf::st_crs("epsg:4326"))

To connect the occurrences with the corresponding climate data at the according point, we combine the species data with the environmental layers by extracting the “env_layer” data and adding it to “filtered_species” via the function “cbind”

filtered_species_extr <- terra::extract(env_layers,filtered_species)
filtered_species <- cbind(filtered_species, filtered_species_extr);rm(filtered_species_extr)

Another step is adding a new column to our filtered species set for the occurrences in which we give every point a binary value. In this case every occurrence gets a 1.

filtered_species$occ=1

As is standard with modelling workflows we will split our newly combined data set into training and test data. We decided on using a 85/15 ratio.

Before we can split the data set in two we have to initialize a custom function which lets us filter for data that is not in a specific data set.

`%not_in%`<- Negate(`%in%`)

Once that is done, we first use the function “slice_sample” on our “filtered_species” and enter the wanted proportion. The “group_by” part groups by species before slicing the sample which ensures,that every species is split into a 85/15 ratio. Then we filter for all the IDs that are not in the newly created subset with our custom function we just initialized. Both functions “slice_sample” and “filter” are part of the “dplyr” package.

test_data <- filtered_species %>% group_by(species) %>% slice_sample(prop=.15)
train_data <- filter(filtered_species, ID %not_in% test_data$ID )

Background data

More or less the same has to be done with the background data. The background data are randomly picked vector points to represent the absence of the species. For this model we create the same amount of these background points as there are species points. Those get directly masked to the border and transformed into a sf object just like before with the species data.

background_points <- sf::st_as_sf(as.data.frame(dismo::randomPoints(mask=raster::stack(env_layers), n=nrow(filtered_species))), crs=terra::crs(env_layers), coords=c("x","y"), remove=F)

Furthermore we add the environmental data, and a column “species” to match the “filtered_species” data set and add an occurrence column but this time with all the values being 0 for abscence.

bgp_extr <- terra::extract(env_layers, background_points)
background_points <- cbind(background_points,bgp_extr);rm(bgp_extr)

species <- c("background")
background_points <- cbind(species,background_points)

background_points$occ=0

Preperations for the if-loop

Before we can begin our calculations and predictions, we have to prepare some support for the loop in which we combine both presence and absence points, as well as build and predict the models. For this we create a temporary variable of all background points, which we can reduce with every iteration.

bp_base_temp <- background_points

Model building

With this we can start building our if-loop. As aforementioned we start of by combining the training data (train_data) and background points. Since we do this for each species we filter the training data, take a part of the background points and bind them with the function “rbind”. If we want to run a test we just have to switch the training data with the test data.

After all this, we can now start with the main part - the model building. To create our model for the “Boosted Regression Tree”-method we use the “gbm.step” method. Since we will build models and predict the occurrence for every species separately we will write a loop iterating through every unique species. In every iteration we will first create a temporary data set including the species and all background points. Afterwards we will use this temporary data set to train our model. Here we can tune the model training with different parameters: gbm.x = with which varibles should the model be trained (bioclim) gbm.y = indicates if it is a absence or occurrence (1/0) tree.complexity = how complex are the single trees learning.rate = contribution of a single tree to the model training. For every model that gets calculated there is an output including a graphic. In this loop we disabled the output with “plot.main = FALSE”.

Further parameter we did not use in this model are site.weights, max.trees and step.size.

Once the model is fitted we save it as an RDS file for every species.

Still in the same iteration we use the function “predict” from the “terra” package to predict the possible occurrences of the species with help of the trained model and the “env_layers” data set as base.

When the possible occurrence are predicted we set a threshold to classify into presences and absence. In our example we use a threshold of 50%. The classification is done with a “ifelse”-clause and the new values a written in the prediction raster.

At the end the file is saved as a raster showing the predicted presence (1) for the species, leaving us with a file for every species

for (i in unique(train_data$species)){

  train_temp <- train_data %>% dplyr::filter(species %in% i) 
  background_temp <- slice_sample(bp_base_temp, n=nrow(train_temp))
  bp_base_temp <- filter(bp_base_temp, ID %not_in% background_temp$ID)
  
  data_temp <- rbind(train_temp,background_temp)%>%as.data.frame()%>%dplyr::select(-geometry) 

  brt_model <- gbm.step(data = data_temp,
                      gbm.x = 5:23,
                      gbm.y = 24,
                      family = "bernoulli",
                      tree.complexity = 4,
                      bag.fraction = 0.5,
                      learning.rate = 0.005,
                      plot.main = FALSE
                      #site.weights = w,
                      #max.trees = 1000,
                      #step.size = 100
                      )

  saveRDS(brt_model, paste0("D:/SDM/sdm-2023-Remmette/Project/Output/Model/BRT_Model_",i,".RDS"))

  
  pred <-  terra::predict(object = env_layers, model= brt_model, na.rm=T)


  max_value <- cellStats(raster(pred), "max")
  min_value <- cellStats(raster(pred), "min")
  threshold <- (max_value+min_value)/2

  pred_class <- ifelse(values(pred) > threshold, 1, 0)

  pred_class_raster <- raster(pred)
  values(pred_class_raster) <- pred_class


  writeRaster(pred_class_raster,paste0("D:/SDM/sdm-2023-Remmette/Project/Output/Pred_Map/BRT_Map_",i,".tif"), overwrite=T)
}

Here is an example for a model fitting output (plot.main = TRUE):

## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for occ and using a family of bernoulli 
## Using 108 observations and 19 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are stratified by prevalence 
## total mean deviance =  1.3863 
## tolerance is fixed at  0.0014 
## ntrees resid. dev. 
## 50    1.2708 
## now adding trees... 
## 100   1.198 
## 150   1.1542 
## 200   1.1332 
## 250   1.1155 
## 300   1.1098 
## 350   1.1124 
## 400   1.1155 
## 450   1.127 
## 500   1.1369 
## 550   1.1485 
## 600   1.1559 
## 650   1.1711 
## 700   1.182 
## 750   1.1901 
## 800   1.1968 
## 850   1.2048 
## 900   1.2128 
## 950   1.2221 
## 1000   1.2326

## fitting final gbm model with a fixed number of 300 trees for occ

## 
## mean total deviance = 1.386 
## mean residual deviance = 0.828 
##  
## estimated cv deviance = 1.11 ; se = 0.081 
##  
## training data correlation = 0.752 
## cv correlation =  0.52 ; se = 0.076 
##  
## training data AUC score = 0.936 
## cv AUC score = 0.786 ; se = 0.044 
##  
## elapsed time -  0.03 minutes

Species Richness Map

At the very end we want to lay all predicted species occurrences over each other and add their values together, pixel by pixel, to create a species richness map. To do this we load all the files we created before from our directory into a stack. Then we use the “calc” function to create a sum for every pixel, to determine the amount of predicted species per pixel.

class_stack <- stack(list.files(path="D:/SDM/sdm-2023-Remmette/Project/Output/Pred_Map/", pattern='.tif$', full.names=TRUE))

class_sum <- calc(class_stack, sum)

After that our species richness map is done!

References

J. Elith, J.R. Leathwick, T.Hastie: A working guide to boosted regression trees (08.04.2008); https://doi.org/10.1111/j.1365-2656.2008.01390.x

Bccvl: https://support.bccvl.org.au/support/solutions/articles/6000083202-boosted-regression-tree, 02.06.2023

Java point: (https://www.javatpoint.com/machine-learning-decision-tree-classification-algorithm, 13.06.2023)

Bioclimatic variables, WorldClim: (https://worldclim.org/data/bioclim.html)

SessionInfo

R environment session info for reproducibility of results:

## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.1.2   gbm_2.1.8.1   dismo_1.3-9   raster_3.6-20 sp_1.6-0     
## [6] terra_1.7-29 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0   xfun_0.38          bslib_0.4.2        sf_1.0-12         
##  [5] splines_4.2.1      lattice_0.20-45    vctrs_0.6.1        generics_0.1.3    
##  [9] geodata_0.5-8      htmltools_0.5.5    yaml_2.3.7         utf8_1.2.3        
## [13] survival_3.3-1     rlang_1.1.0        e1071_1.7-13       jquerylib_0.1.4   
## [17] pillar_1.9.0       withr_2.5.0        glue_1.6.2         DBI_1.1.3         
## [21] lifecycle_1.0.3    codetools_0.2-18   evaluate_0.20      knitr_1.42        
## [25] fastmap_1.1.1      class_7.3-20       fansi_1.0.4        highr_0.10        
## [29] Rcpp_1.0.10        KernSmooth_2.23-20 classInt_0.4-9     cachem_1.0.7      
## [33] jsonlite_1.8.4     digest_0.6.31      grid_4.2.1         rgdal_1.6-5       
## [37] cli_3.6.1          tools_4.2.1        magrittr_2.0.3     sass_0.4.5        
## [41] proxy_0.4-27       tibble_3.2.1       pkgconfig_2.0.3    Matrix_1.4-1      
## [45] rmarkdown_2.21     rstudioapi_0.14    R6_2.5.1           units_0.8-2       
## [49] compiler_4.2.1