1. Theory of the Random Forest

1.1 Basics of random forest: Classification trees

At the heart of every random forest lies the decision tree (Probst 2019, S. 4). The goal of every decision tree is to predict the outcome variable based on a number of input variables (ebd.). For our case these are the pixel values of the prediction raster. The tree consists of nodes, which represent a variable and leaf’s, which connect the nodes to each other (Genuer & Poggi 2020, S. 9). The principal of the tree is to split the outcomes based on prediction variables in the nodes (ebd.). For example, a hypothetical node in a tree could have the prediction variable average temperature and would split the outcomes at a value of 12. The method is a very easy and clear method of statistical decision but it has some downsides on its own (Probst 2019, S. 4). The most common problem connected with single decision trees is the overfitting of the training data (ebd.). This means that the predictions fit well with the given training data, but with any other dataset the outcome is far worst (ebd.). This shortcoming led to the development of the random forest by Breiman in 2001 (Breiman 2001, S. 5).

1.2 Random forest

The random forest is based on decision trees but it adds a little twist (Ramosaj 2020, S. 24). On one side not only one tree is used thus the name random forest literally meaning a forest of decision trees (ebd.). This alone greatly improves the performance of the prediction (ebd.). The other part that makes the random forest such a good machine learning approach is its randomness (ebd.). For every tree in the forest the subset of data is randomly selected, commonly referred to as bootstrapping (ebd.). To further improve performance Breiman implemented a random selection of the prediction variables at every spit (node) (ebd.). The best on of this subset is than determined the same way as with a normal decision tree using the Gini-index (Genuer & Poggi 2020, S. 9). The following Hyperparameters are important for the tree construction and have to be considered:

  • Number of decision trees to be generated
  • Sampling strategy (subsampling or bootstrapping for example)
  • Number of selected features for splits
  • Number of observations for each tree
  • Number of leaves in each tree

This chapter was only a very brief overview over the thematic, for more in-depth information the book from Breiman can be recommended. Click here

For a more practical explanation the following video with R sample code might be interesting. Click here

2. Information about the dataset

2.1 Occurence Data

The dataset is comprised of the occurrence data from 429 species of butterfly from Pakistan. The format used contains a table with three rows labeled “LONGITUDE”, “LATITUDE” and “SPECIES”. The first two contain the coordinated at which there was an observation of the species. The special information is stored with geographic coordinates in the WGS84 reference system. In the “SPECIES” column are the individual species names of the butterflies. The Type of data here is often described as presence data, in opposition to absence data. Absence data is composed of points where the species is known to not exist. Our Package can handle both data types, but only the presence data is used in this tutorial.

2.2 Prediction Data

Our prediction dataset is composed out of five different raster layers of Pakistan. The data was provided by the course instructor. The rasters are datasets from Worldclim which are already cropped to the extent of Pakistan. The data can be downloaded from this website Click here. The Variables used for this prediction are all average values from the Climatological normal of 1970-2000. The raster all have a resolution of 30 arcsecond (1 by 1 km cells). The set originally contained two more Raster (minimum and maximum Temp) but these were deleted beforehand because of very high multicollinearity to the average temperature. Important for the used package (SSDM) is that the raster layer all have the same extend.

3. Setting up the script

3.1 Loading the packages

As the first step in our R Script, we need to load all the libraries for the Task. Especially for the SSDM-Package it is important to run the command installAll() the first time we use the package to install all the required packages. / For our Task we need several Packages:

The most important is ssdm. It is used to create our random forest prediction model. The package is very easy to use and specifically developed for species distribution modeling. Alongside Random Forest it offers many other prediction methods. It also offers the possibility to compute several models for many species at once, but this functionality requires a very powerful computer so it is just shown here.

Other packages required are raster for the handling of all raster objects and rgdal for similar purposes.

library(SSDM)
library(raster)
library(rgdal)

3.2 Loading the climate data

First we define our folder structure. The folder climate contains the raster data of the prediction variables. In our case these are sun radiation, wind speed, water vapor pressure, precipitation and yearly average temperature. The five raster layer are packed into a raster stack.

home<-"/Users/leanderheyer/Desktop/SDM/"

climate<-paste0(home,"Climat_data/")

occ<-paste0(home,"occ/")

We use the load_var function of the ssdm package to load our predictor variables.

pred<- load_var(path=climate)
## Variables loading 
## Variables treatment 
##    resolution and extent adaptation...   done... 
##    normalizing continuous variable

The raster stack now looks like this and contains the mentioned layers.

plot(pred)

3.3 Loading the species data


In the next step the point data of the Butterfly in Pakistan is loaded in with the help of another function from the SSDM package: load_occ. The data is inside the occ Folder. The function gives many options, so in our case we define all the columns we need for the modeling like the spacial coordinates and the SPECIES column. The function also uses the existing raster-layers to normalize the data, cutoff points outside of the prediction values and also delete SPECIES which have too few measurements to make a meaningful prediction.

OCC <- load_occ(path=occ,pred ,Xcol = 'LONGITUDE', Ycol = 'LATITUDE',Spcol = 'SPECIES',
                file = 'butterfly.csv', sep = ',')


3.4 Loading additional data

Next the Administrative Areas of Pakistan are loaded from the GADM API. This Spacial Polygon will be useful for plotting later on.

borders  <- raster::getData("GADM",country="Pakistan",level=0)

4. Modelling

4.1 Modelling a single model

Now we can make a random forest model of one of the species to show the process. For this we again use the SSDM package. The function is called modeling and again has many things we can adjust. First we decide which modeling method we want to use. Besides random forest there are many other popular methods like Generalized linear model, Artificial neural network or Support vector machines. The random forest method of SSDM in essence uses the randomForest function from the R-package randomForest.

Parameter you can change in regards of the random forest are the number of trees (default value is 2500) and the Minimum size of terminal node. This refers to the minimal number of data points a node can refer to. Larger numbers lead to a fast computing due to a smaller size of the trees, but possible also to a worst prediction (default Value is 1).

The function does many things which otherwise had to be done manually like creating a presence/absence table or creating random background data points.

For this model we subset the data-set to only calculate the possible habitat of one species (in this case Aglais_caschmirensis)

SDM <- modelling('RF', subset(OCC, OCC$SPECIES == 'Aglais_caschmirensis'), 
                 pred, Xcol = 'LONGITUDE', Ycol = 'LATITUDE')
## Data check ... 
## No presence column, presence-only data set is supposed.
## Pseudo-absence selection will be computed.
##    done. 
## 
## Pseudo absence selection... 
##    random selection 
##    done. 
## 
## Model evaluation...
##    done. 
## 
## Model projection...
##    done. 
## 
## Model axes contribution evaluation...
##    done.

The result is an object from the class RF.SDM which contains many useful information about the model.

str(SDM)
## Formal class 'RF.SDM' [package "SSDM"] with 7 slots
##   ..@ name               : chr "RF.SDM"
##   ..@ projection         :Formal class 'RasterLayer' [package "raster"] with 12 slots
##   .. .. ..@ file    :Formal class '.RasterFile' [package "raster"] with 13 slots
##   .. .. .. .. ..@ name        : chr ""
##   .. .. .. .. ..@ datanotation: chr "FLT4S"
##   .. .. .. .. ..@ byteorder   : chr "little"
##   .. .. .. .. ..@ nodatavalue : num -Inf
##   .. .. .. .. ..@ NAchanged   : logi FALSE
##   .. .. .. .. ..@ nbands      : int 1
##   .. .. .. .. ..@ bandorder   : chr "BIL"
##   .. .. .. .. ..@ offset      : int 0
##   .. .. .. .. ..@ toptobottom : logi TRUE
##   .. .. .. .. ..@ blockrows   : int 0
##   .. .. .. .. ..@ blockcols   : int 0
##   .. .. .. .. ..@ driver      : chr ""
##   .. .. .. .. ..@ open        : logi FALSE
##   .. .. ..@ data    :Formal class '.SingleLayerData' [package "raster"] with 13 slots
##   .. .. .. .. ..@ values    : num [1:3112686] NA NA NA NA NA NA NA NA NA NA ...
##   .. .. .. .. ..@ offset    : num 0
##   .. .. .. .. ..@ gain      : num 1
##   .. .. .. .. ..@ inmemory  : logi TRUE
##   .. .. .. .. ..@ fromdisk  : logi FALSE
##   .. .. .. .. ..@ isfactor  : logi FALSE
##   .. .. .. .. ..@ attributes: list()
##   .. .. .. .. ..@ haveminmax: logi TRUE
##   .. .. .. .. ..@ min       : num 0
##   .. .. .. .. ..@ max       : num 1
##   .. .. .. .. ..@ band      : int 1
##   .. .. .. .. ..@ unit      : chr ""
##   .. .. .. .. ..@ names     : chr "Projection"
##   .. .. ..@ legend  :Formal class '.RasterLegend' [package "raster"] with 5 slots
##   .. .. .. .. ..@ type      : chr(0) 
##   .. .. .. .. ..@ values    : logi(0) 
##   .. .. .. .. ..@ color     : logi(0) 
##   .. .. .. .. ..@ names     : logi(0) 
##   .. .. .. .. ..@ colortable: logi(0) 
##   .. .. ..@ title   : chr(0) 
##   .. .. ..@ extent  :Formal class 'Extent' [package "raster"] with 4 slots
##   .. .. .. .. ..@ xmin: num 60.9
##   .. .. .. .. ..@ xmax: num 77
##   .. .. .. .. ..@ ymin: num 23.7
##   .. .. .. .. ..@ ymax: num 37
##   .. .. ..@ rotated : logi FALSE
##   .. .. ..@ rotation:Formal class '.Rotation' [package "raster"] with 2 slots
##   .. .. .. .. ..@ geotrans: num(0) 
##   .. .. .. .. ..@ transfun:function ()  
##   .. .. ..@ ncols   : int 1943
##   .. .. ..@ nrows   : int 1602
##   .. .. ..@ crs     :Formal class 'CRS' [package "sp"] with 1 slot
##   .. .. .. .. ..@ projargs: chr "+proj=longlat +datum=WGS84 +no_defs"
##   .. .. ..@ history : list()
##   .. .. ..@ z       : list()
##   ..@ binary             :Formal class 'RasterLayer' [package "raster"] with 12 slots
##   .. .. ..@ file    :Formal class '.RasterFile' [package "raster"] with 13 slots
##   .. .. .. .. ..@ name        : chr ""
##   .. .. .. .. ..@ datanotation: chr "FLT4S"
##   .. .. .. .. ..@ byteorder   : chr "little"
##   .. .. .. .. ..@ nodatavalue : num -Inf
##   .. .. .. .. ..@ NAchanged   : logi FALSE
##   .. .. .. .. ..@ nbands      : int 1
##   .. .. .. .. ..@ bandorder   : chr "BIL"
##   .. .. .. .. ..@ offset      : int 0
##   .. .. .. .. ..@ toptobottom : logi TRUE
##   .. .. .. .. ..@ blockrows   : int 0
##   .. .. .. .. ..@ blockcols   : int 0
##   .. .. .. .. ..@ driver      : chr ""
##   .. .. .. .. ..@ open        : logi FALSE
##   .. .. ..@ data    :Formal class '.SingleLayerData' [package "raster"] with 13 slots
##   .. .. .. .. ..@ values    : num [1:3112686] NA NA NA NA NA NA NA NA NA NA ...
##   .. .. .. .. ..@ offset    : num 0
##   .. .. .. .. ..@ gain      : num 1
##   .. .. .. .. ..@ inmemory  : logi TRUE
##   .. .. .. .. ..@ fromdisk  : logi FALSE
##   .. .. .. .. ..@ isfactor  : logi FALSE
##   .. .. .. .. ..@ attributes: list()
##   .. .. .. .. ..@ haveminmax: logi TRUE
##   .. .. .. .. ..@ min       : num 0
##   .. .. .. .. ..@ max       : num 1
##   .. .. .. .. ..@ band      : int 1
##   .. .. .. .. ..@ unit      : chr ""
##   .. .. .. .. ..@ names     : chr "Projection"
##   .. .. ..@ legend  :Formal class '.RasterLegend' [package "raster"] with 5 slots
##   .. .. .. .. ..@ type      : chr(0) 
##   .. .. .. .. ..@ values    : logi(0) 
##   .. .. .. .. ..@ color     : logi(0) 
##   .. .. .. .. ..@ names     : logi(0) 
##   .. .. .. .. ..@ colortable: logi(0) 
##   .. .. ..@ title   : chr(0) 
##   .. .. ..@ extent  :Formal class 'Extent' [package "raster"] with 4 slots
##   .. .. .. .. ..@ xmin: num 60.9
##   .. .. .. .. ..@ xmax: num 77
##   .. .. .. .. ..@ ymin: num 23.7
##   .. .. .. .. ..@ ymax: num 37
##   .. .. ..@ rotated : logi FALSE
##   .. .. ..@ rotation:Formal class '.Rotation' [package "raster"] with 2 slots
##   .. .. .. .. ..@ geotrans: num(0) 
##   .. .. .. .. ..@ transfun:function ()  
##   .. .. ..@ ncols   : int 1943
##   .. .. ..@ nrows   : int 1602
##   .. .. ..@ crs     :Formal class 'CRS' [package "sp"] with 1 slot
##   .. .. .. .. ..@ projargs: chr "+proj=longlat +datum=WGS84 +no_defs"
##   .. .. ..@ history : list()
##   .. .. ..@ z       : list()
##   ..@ evaluation         :'data.frame':  1 obs. of  7 variables:
##   .. ..$ threshold    : num 0.562
##   .. ..$ AUC          : num 0.906
##   .. ..$ omission.rate: num 0.0938
##   .. ..$ sensitivity  : num 0.906
##   .. ..$ specificity  : num 0.906
##   .. ..$ prop.correct : num 0.906
##   .. ..$ Kappa        : num 0.812
##   ..@ variable.importance:'data.frame':  1 obs. of  5 variables:
##   .. ..$ prec    : num 19.2
##   .. ..$ srad    : num 46.9
##   .. ..$ temp_avg: num 8.94
##   .. ..$ vapr    : num 8.86
##   .. ..$ wind    : num 16.1
##   ..@ data               :'data.frame':  106 obs. of  8 variables:
##   .. ..$ X       : num [1:106] 71.1 71.6 71.7 71.7 71.7 ...
##   .. ..$ Y       : num [1:106] 33.8 35.9 35.8 35.9 36 ...
##   .. ..$ Presence: num [1:106] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ prec    : num [1:106] 0.343 0.556 0.269 0.241 0.259 ...
##   .. ..$ srad    : num [1:106] 0.625 0.423 0.459 0.471 0.464 ...
##   .. ..$ temp_avg: num [1:106] 0.227 -0.778 -0.197 -0.283 -0.46 ...
##   .. ..$ vapr    : num [1:106] 0.374 0.0611 0.1527 0.1603 0.1221 ...
##   .. ..$ wind    : num [1:106] 0.169 0.217 0.181 0.205 0.253 ...
##   ..@ parameters         :'data.frame':  1 obs. of  6 variables:
##   .. ..$ data       : chr "presence-only data set"
##   .. ..$ metric     : chr "SES"
##   .. ..$ PA         : logi TRUE
##   .. ..$ cv         : chr "holdout"
##   .. ..$ cv.param   : chr "|0.7|2"
##   .. ..$ axes.metric: chr "Pearson"

The part we mainly care about is the species habitat map which contains the possible spreading of the species. The map is stored in and contains a raster-layer with the possibility of the species occurring with values from 0 to 1.

plot(SDM@projection)
plot(borders, add=TRUE)

For assessing the performance of the predication we can use the statistical metrics provided in the tab of the model.

SDM@evaluation
##   threshold     AUC omission.rate sensitivity specificity prop.correct  Kappa
## 1   0.56225 0.90625       0.09375     0.90625     0.90625      0.90625 0.8125

A useful metric to look here is the AUC (area under curve) which describes the performance of the model in regards of prediction. Its composed from the False Positive Rate (specificity) on the x-axis and the True Positive Rate (Sensitivity) on the y-axis. A perfect prediction with only True Positives would yield a Value of 1, a purely random selection with a equal number of true and false positive would yield a 0.5 score. A value above 0.9 is considered a excellent model performance. Another value to look at is the Kappa. Here values over 0.8 are considered very good.


Our model can be considered very good with a AUC of 0.9375 and a Kappa of 0.875.

The last thing we take a look at is the importance of our predictor variables. The results are displayed in percentage of the overall prediction.

SDM@variable.importance
##                     prec     srad temp_avg     vapr     wind
## Axes.evaluation 19.22326 46.86036 8.942649 8.859555 16.11418

In case of our model the solar radiation has the biggest importance followed by precipitation and wind. Least important was vapor pressure.

4.2 Modelling multible models

To create a species richness map of all the butterfly species in Pakistan we basically do the same things as above for all the species. So we use the modeling() function for all the species one after the other. To do this we use a for-Loop.

Seeting up aditional data

Before we can loop we have to make a new variable with all the unique names of the species. Thanks to the SSDM package the species column is already in the factor format so we simply can use levels() to get the names of all species.

species_names<-levels(OCC$SPECIES)



For the species maps we need to convert the species habitat maps in a binary form to add up the occurrences. To do this we use a function from Cecina Marrow. The function uses the MTP (minimum training presence) to calculate the threshold for every single species. This methods assumes, that the known presence with the least suitable habitat is the habitat border for this species.

For more information you can visit her website. Click here

sdm_threshold <- function(sdm, occs, type = "mtp", binary = FALSE){
  occPredVals <- raster::extract(sdm, occs)
  if(type == "mtp"){
    thresh <- min(na.omit(occPredVals))
  } else if(type == "p10"){
    if(length(occPredVals) < 10){
      p10 <- floor(length(occPredVals) * 0.9)
    } else {
      p10 <- ceiling(length(occPredVals) * 0.9)
    }
    thresh <- rev(sort(occPredVals))[p10]
  }
  sdm_thresh <- sdm
  sdm_thresh[sdm_thresh < thresh] <- NA
  if(binary){
    sdm_thresh[sdm_thresh >= thresh] <- 1
  }
  return(sdm_thresh)
}

Finally we create a empty list to save our raster layer into during the looping.

Binary=list()



4.3 Looping the models

Now we loop through the different models. In the loop we first make a model with a single species. Then we use the prediction map and a subset of the dataframe with only our current species to calculate the binary map for only this species. At the end the raster is saved to our list. This process repeats for approximately 320 times, which can take a while depending on your computer. Note that the script wont actually calculate here because of great amount of Time it takes.

for (i in species_names) {
  
  data <- modelling('RF', subset(OCC, OCC$SPECIES == i), pred, Xcol = 'LONGITUDE', Ycol = 'LATITUDE')  
  
  sloth_sdm<-data@projection
  
  sloth_occs<-subset(oscc, oscc$SPECIES == i)
  
  sloth_mtp_bin <- sdm_threshold(sloth_sdm, sloth_occs[,2:3], "mtp", binary = TRUE)
  
  Binary[i]<-sloth_mtp_bin
  
}

Now we stack the raster layer in the list to a new raster stack.

Binary_stack<-stack(Binary)

Finally we calculate the species richness map by summing up all binary raster layer into one layer. The final resolution of the map is equal to the spacial resolution of the raster layers. When you don’t have a very powerful computer it is possible to lower the resolution to increase performance.

species_richness<-stackApply(Binary_stack, indices=c(1), fun=sum, na.rm=TRUE) 

Now we can take a look at our final map.

plot(species_richness)
plot(borders, add=TRUE)

We can see that the species richness is greatest in the very north of Pakistan, with considerably lower numbers in the rest of the country. When we compare this map to our prediction maps from the start we can see that the butterfly prefer a not that hot climate with higher Precipitation and lower sun radiation compared to the south of Pakistan.

5. Additional information

5.1 Low species count

The count of species in our final map is rather low in comparison to the 400+ species that are in the dataset. The reason behind this is that the function to load in the occurrence data strips many species because they have to few points. Without enough data points a meaningful prediction isn’t possible.

5.2 Alternative calculation methods

There is another method inside the SSDM-Package for calculating the species richness map. The function is called stack_modelling(). It functions similar to the modelling() function but works with multiple species and prediction methods. The resulting object contains normal and already binary maps with calculated thresholds of every single species and model performance metrics. Also the function directly calculates a species richness map which can be loaded with the command. The big drawback and the reason this function wasn’t used in the tutorial besides it’s clear advantages is the enormous demands on the computers hardware. Especially in our case with the very large dataset and over 400 different species. The problem is that the functions will load all the models into RAM until the calculation is finished. This can lead to in our case more than 60 Gb of RAM allocation which is too much for the majority of computers. / With the help of a server workstation or maybe cloud computing with allot of RAM this method could be viable for the dataset but for now it can’t be endorsed for home users.

7. Session Info

This tab fulfills the task of showing the installed packages and the computer and software environment for better reproducibility and troubleshooting.

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rgdal_1.5-23  raster_3.4-10 sp_1.4-5      SSDM_0.2.8   
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.3.1           maps_3.3.0           jsonlite_1.7.2      
##  [4] splines_4.0.3        dotCall64_1.0-0      bslib_0.2.4         
##  [7] Formula_1.2-4        shiny_1.6.0          assertthat_0.2.1    
## [10] highr_0.9            yaml_2.2.1           pillar_1.6.0        
## [13] lattice_0.20-41      glue_1.4.2           spThin_0.2.0        
## [16] digest_0.6.27        promises_1.2.0.1     randomForest_4.6-14 
## [19] colorspace_2.0-0     gbm_2.1.8            htmltools_0.5.1.1   
## [22] httpuv_1.5.5         Matrix_1.3-2         plyr_1.8.6          
## [25] pkgconfig_2.0.3      earth_5.3.0          bookdown_0.22       
## [28] purrr_0.3.4          xtable_1.8-4         scales_1.1.1        
## [31] later_1.2.0          TeachingDemos_2.12   tibble_3.1.1        
## [34] mgcv_1.8-33          generics_0.1.0       ggplot2_3.3.3       
## [37] ellipsis_0.3.2       nnet_7.3-15          survival_3.2-7      
## [40] magrittr_2.0.1       crayon_1.4.1         mime_0.10           
## [43] evaluate_0.14        fansi_0.4.2          nlme_3.1-151        
## [46] class_7.3-18         shinydashboard_0.7.1 tools_4.0.3         
## [49] dismo_1.3-3          lifecycle_1.0.0      stringr_1.4.0       
## [52] munsell_0.5.0        plotrix_3.8-1        compiler_4.0.3      
## [55] jquerylib_0.1.3      e1071_1.7-4          rlang_0.4.10        
## [58] plotmo_3.6.0         grid_4.0.3           spam_2.6-0          
## [61] rmarkdown_2.7        gtable_0.3.0         codetools_0.2-18    
## [64] DBI_1.1.1            reshape2_1.4.4       R6_2.5.0            
## [67] knitr_1.33           dplyr_1.0.5          fastmap_1.1.0       
## [70] utf8_1.2.1           poibin_1.5           stringi_1.5.3       
## [73] rmdformats_1.0.2     Rcpp_1.0.6           fields_11.6         
## [76] vctrs_0.3.7          rpart_4.1-15         tidyselect_1.1.0    
## [79] xfun_0.22