UNBIASED CONDITIONAL INFERENCE FOREST (cforest)

1 Introduction

The following tutorial is part of the seminar “Species Distribution Modeling”. The aim of this tutorial is to introduce You to the cforest method by modeling a species distribution of butterfly species in Pakistan and creating a species richness map from it.

Figure 1: Aglais cashmirensis (license: CC BY-SA 4.0)

2 What is cforest ?

Cforest, like Random Forest, is a statistical method for performing regression analyses. It uses an algorithm to create probability trees based on several independent variables (HOTHORN et al. 2006, p. 1). In contrast to linear regressions, random forest and cforest have the advantage that they can process large amounts of data with many complex variables in a relatively short time (STROBEL 2009a, p. 4). This fits well for our project to predict potential butterfly occurrences with hundreds of butterfly sites and a lot of biovariates like annual rainfall or temperature data.

2.1 What is a decision tree?

As you see below, a decision tree is very easy to visualize. It consists of different internal nodes that divide sampled data into two categories by an independent variable. Through this procedure, the impurity of the data will be reduced. At the end of each branch are the leaf nodes which contain the splitted data. If one leaf node contains just one category of the data this leaf node is pure and the impurity value is zero (STROBEL et al. 2009a, p. 9). If one leaf node has more than one category, the impurity rate is higher (ibid.). However, the category with the most values will win the decision when the data of the leaf node is aggregated (BREIMAN 2001, p. 6). Below you can see a little example of a decision tree with the important vocabulary’s internal node, branch, variable and leaf node.

Figure 2: Example of a Decision Tree

3 From Random Forest to cforest

Cforest is a further development of Random Forest and works only slightly different. To understand cforest, it is therefore useful to explain Random Forest first.

3.1 How does Random Forest work

Random Forest is characterized by the random selection of samples and variables (STROBEL et al. 2009a, p. 16). From a sample (for example our butterfly data) subsets are drawn at random, from which decision trees are then formed with randomly selected variables. At the end of the decision trees, i.e. in the leaf nodes, the individual observations can be found in classes. Thereby, the purity of the data summarized in classes was significantly increased by splitting in the nodes before. At the end, the observations of the classes are aggregated to one result. The result, which occurs in the most classes, wins (BREIMAN 2001, S. 1). The repeated dragging and dropping of the subsets from the data is called bootstrapping. Together with aggregating the data at the end of the decision trees, this process is also called bagging (HOTHORN et al. 2004, p. 78). Such a procedure results in good stability of the model and is better than just one decision tree (ibid.). This is because within individual decision trees there will certainly be individual errors. But if thousand such random decision trees are formed several times and the results are aggregated together, then the prediction will be very good on average. If you want do dive deeper in the Theory of Random Forests, we can recommend you the student tutorial from Mandy Gimpel which you can find here: Using Random Forest in R for Butterfly Fauna Distribution in Pakistan (geomoer.github.io).

3.2 Variable selection Bias in Random Forest

After explaining Random Forest to you, we can move on to the further development of this model: to cforest. One problem with Random Forest’s decision trees is that they estimate certain variables to be more important than they actually are (STROBEL et al. 2009b, p. 1). These are especially variables with many categories or continuous data but also variables which have many missing measurement data (STROBEL et al 2009a, p. 30). This creates a bias that affects the importance of the variables and impacts the validity of the model (ibid.). Such a bias arises especially when many heterogeneous variables with different numbers of categories are used to create the decision trees (HOTHORN et al. 2006, p. 2). The cause of the incorrect estimation are calculations of the Gini importance or the permutation importance measurement of a variable. These two tests show how well a variable can clean up the data in a node by splitting the data (STROBEL et al. 2009a, p. 8). To put it in a nutshell, when the variable importance measure is biased towards variables with a lot of categories, these variables will be preferred to build the decision tree so the output of the model will be biased in an incorrect direction (STROBEL et al. 2007, p. 2). Since we use data with a high range of continuous data, i.e. the annual precipitation in the variable bio 12 (WORLDCLIM 2022) and data with fewer categories, i.e. the min temperature of the coldest month in the variable bio 06 in predicting butterfly occurrence, it makes sense to avoid such bias in the importance of variables with cforest. One more note: there are even more studies that are trying to solve the problem of biased variable selection using a corrected impurity measure (AIR) (NEMBRINI et al. 2018, p. 3711f.).

3.3 Solving the variable selection bias problem with cforest

But how does cforest manage to circumvent this bias within the independent variables? This is done by measuring the significance of each variable selected for a node in the decision tree. To be absolutely clear, cforest creates random forests but the Gini importance and the permutation importance measurement for the selection of variables in the trees are taken out of the race (STROBEL et al. 2007, p. 3). Instead of this, the variable that has the highest significance, i.e. explains the largest part of the independent variable, is used as the first splitting variable within a node in the decision tree (HOTHORN et al., 2006, p. 2). All other variables that are also significant and thus explain a certain proportion of our dependent variable can also be used in the decision tree. However, if the significance test of a variable turns out to be negative, i.e. a p-test for example above 0.05%, then this tree is terminated (ibid., p. 4). This is to circumvent the bias of variable weighting, to achieve a better predictive accuracy of the model and to be able to analyse data with heterogeneous variables. We will call the trees that are created with the cforest function in R conditional inference trees (STROBEL et al. 2007, p. 5). Below is an example of one of our decision trees created with c-forest. The p-value is below 0.05 in all branches, which is a good result and explains why the variables are used to split the data into smaller categories.

Figure 3: Cforest tree example

3.4 Even more bias problems with variables that have many categories?

Unfortunately, there is also a small problem with the selection of variables in cforest. If bootstrapping, i.e. the repeated dragging and replacement from a sample, is performed with heterogeneous variables, then variables with many categories are also preferred in cforest (STROBEL et al. 2007, p. 17). This problem can be solved if the drawing of the variables is done without replacement. Then, no variables with many categories are used more often than other variables for creating the decision trees (ibid.).

3.5 Overfitting as a second problem of decision trees

Another problem with using decision trees is overfitting (HOTHORN et al. 2006, p. 2). Overfitting occurs when the model learns the structures inherent in the learning sample (STROBEL et al 2009a, p. 9). However, these structures are not found in reality, but are only created by the arbitrary selection of the sample. To prevent this adaptation of the model to the learning data, the length of the decision trees is limited. They can simply be pruned from a certain number of nodes or they can be interrupted by a statistical criterion. In cforest, for example, the continuation of the decision tree can be interrupted as soon as a selected significance level is not reached (HOTHORN et al. 2006, p. 7). This also addresses the question of how long decision trees should grow. STROBEL et al. also points out that it can be useful to limit the length of the decision trees to avoid overfitting (STROBEL et al. 2009a, p. 10). Especially with a large number of data, the predictive accuracy of the model seems to increase when using many but small trees. It is different for small samples. In this case, the use of large trees makes sense (ibid., p. 33). Since we often have a small sample of butterfly species, we should apply this advice and use large decision trees.

4 A short overview over the bioclim data

For our species distribution model, we will use the free available Bioclimatic variables from WorldClim dataset for Pakistan. This is part of the geodata package. They contain 19 different variables which are generated from monthly temperature and rainfall data (WorldClim 2022). When you have a closer look, you can see that they measure annual trends, for example the mean annual temperature, seasonality, represented in annual data and extreme environmental factors (ibid.). An example for an extreme factor is the precipitation of the driest quarter and could be a limiting environmental factor for the occurrence of our butterflies in Pakistan. Below you can see all 19 variables which are part of the WorldClim biodata variables:

BIO1 = Annual Mean Temperature

BIO2 = Mean Diurnal Range (Mean of monthly (max temp - min temp))

BIO3 = Isothermality (BIO2/BIO7) (×100)

BIO4 = Temperature Seasonality (standard deviation ×100)

BIO5 = Max Temperature of Warmest Month

BIO6 = Min Temperature of Coldest Month

BIO7 = Temperature Annual Range (BIO5-BIO6)

BIO8 = Mean Temperature of Wettest Quarter

BIO9 = Mean Temperature of Driest Quarter

BIO10 = Mean Temperature of Warmest Quarter

BIO11 = Mean Temperature of Coldest Quarter

BIO12 = Annual Precipitation

BIO13 = Precipitation of Wettest Month

BIO14 = Precipitation of Driest Month

BIO15 = Precipitation Seasonality (Coefficient of Variation)

BIO16 = Precipitation of Wettest Quarter

BIO17 = Precipitation of Driest Quarter

BIO18 = Precipitation of Warmest Quarter

BIO19 = Precipitation of Coldest Quarter

5 Modeling Workflow for one unique species

5.1 Loading the packages we need and setting the working directory

First we clean the environment and use the garbage collector to free some memory

rm(list=ls())# removes everything in the environment
gc()# garbage collector

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 514278 27.5    1147244 61.3   644242 34.5
## Vcells 916801  7.0    8388608 64.0  1635552 12.5

Used packages

library(caret)
library(sf)
library(geodata)
library(dismo)
library(dplyr)
library(mapview)
library(party)
library(terra)
library(Metrics)
library(ecospat)
library(raster)
library(leaflet)
library(maptiles)
library(ggplot2)
library(gt)
library(leaflet.extras2)

Working directory

wd = "C:\\UNI\\SoSe_23\\SDM\\workflowSDM"# the path varies on your structure
setwd(wd)

5.2 Loading our data

We start by loading the species file “PakistanLadakh.csv” provided to us in the course. It is saved in CSV format and contains individual butterfly species with presence points and xy coordinates collected in Pakistan.

data <- read.csv(file.path(wd, "data", "PakistanLadakh.csv"))# we are using the standard read.csv function to read in the data

And change the column names

colnames(data) <- c("species","x","y")

We can now count the number of occurrences per species to see a bit of the data structure.

spec_names <- unique(data$species)
cat("The Dataset contains", format(length(spec_names)), "different species 
with overall", format(length(data$species)), "presence entries, 
and an average of", format(round(length(data$species)/length(spec_names),0)), "presence entries per species.")

## The Dataset contains 429 different species 
## with overall 5870 presence entries, 
## and an average of 14 presence entries per species.

After that we load the environmental variables for Pakistan. For this task we just used the bioclimate ‘bio’ data provided by WorldClim climate data in the geodata package. You can also use a lot more variables as predictors as you can see on the github page “rspatial / geodata”.

# get the environmental layers
subfolderPath <- file.path(wd, "data")
raster <- geodata::worldclim_country(country="PAK", res=10, var="bio", path=subfolderPath, version = "2.1")
names(raster) <- substr(names(raster), 11, 20)

# get the gadm (Database of Global Administrative Areas) border of Pakistan also from the geodata package
gadmPak <- geodata::gadm(country="Pakistan", level=0, path=subfolderPath)

# and mask the raster layer to the border of Pakistan
raster <- terra::mask(raster, gadmPak)

# save the raster for later use
terra::writeRaster(raster, file.path(wd, "data", "bioclimPak.tif"), overwrite=T)

5.3 Data preparation

For testing the model we selecting just one single species from the provided data. In this case Aglais caschmirensis.

Aglais_caschmirensis <- dplyr::filter(data, species=="Aglais_caschmirensis")# using the filter function from dplyr package

# We are transforming the data.frame into an sf object for later using some functions in the terra package. Therefore we use the sf package
# You can ckeck the crs (coordinate reference system) by using the crs function in the terra package (example: terra::crs(raster))
Aglais_caschmirensis <- sf::st_as_sf(Aglais_caschmirensis, coords=c("x", "y"), remove=F, crs=sf::st_crs("epsg:4326"))

Next we create some background absence points, for this example 1000.

n = 1000 # we use randomly generated 1000 sample points, You can also use more or less points
# for generating we use the dismo package and the randomPoints function
bg <- sf::st_as_sf(as.data.frame(dismo::randomPoints(mask=raster::stack(raster), 
                                                                   n=n)), 
                                                                   crs=terra::crs(raster), 
                                                                   coords=c("x","y"), remove=F)

# after generating we extract the environmental information for background data from our SpatRaster raster
bg_extr <- terra::extract(raster, bg)
bg <- cbind(bg,bg_extr);rm(bg_extr)

# and binding the extracted background absence points with the extracted presence points from the butterfly species to one data.frame
Aglais_caschmirensis_extr <- terra::extract(raster, Aglais_caschmirensis)
Aglais_caschmirensis <- cbind(Aglais_caschmirensis, Aglais_caschmirensis_extr);rm(Aglais_caschmirensis_extr)

We can now plot the data to look were the butterflies are in presence. Therefore we use the leaflet package.

leaflet(data = Aglais_caschmirensis) %>% addTiles() %>% 
  setView(lat = 33.7, lng = 73.1, zoom = 4) %>%
  addProviderTiles(providers$Esri.WorldImagery, group = "Esri World Imagery") %>%
  addAwesomeMarkers(~x, ~y, popup = ~ID, label = ~ID, group = "presence points of Aglais caschmirensis")  %>%
  addPolygons(data = gadmPak, col="black", fillColor="Transparent",
              opacity = 1, group = "Border of Pakistan") %>%
  addLayersControl(
    baseGroups = "Esri World Imagery",
    overlayGroups = c("presence points of Aglais caschmirensis", "Border of Pakistan"),
    options = layersControlOptions(collapsed = F))

And now we are splitting and saving the data of the selected butterfly species into a training and testing set by using the dplyr package. We decided for 80 percent training and 20 percent testing. If You have more available presence data points You can also go for 75/25 or 70/30 splitting.

`%not_in%`<- Negate(`%in%`)# formula to split the data

set.seed(2023) #set a seed for reproducibility

testData=dplyr::slice_sample(Aglais_caschmirensis, prop=.20)
trainData=dplyr::filter(Aglais_caschmirensis, ID %not_in% testData$ID )

# We save the test, train and background data by using the sf package
sf::write_sf(testData, file.path(wd, "data", "Aglais_caschmirensis_testData.gpkg"))
sf::write_sf(trainData, file.path(wd, "data","Aglais_caschmirensis_trainData.gpkg"))
sf::write_sf(bg, file.path(wd, "data","background.gpkg"))

Now we load the saved training and background data created in the step before to make one data.frame with presence and absence points for the modeling.

# read the presence only data
po <- sf::read_sf(file.path(wd, "data", "Aglais_caschmirensis_trainData.gpkg"))
po$occ=1
# read the absence only data
bg <- sf::read_sf(file.path(wd, "data","background.gpkg"))
bg$occ=0
# we can delete the species column because we have just one unique species
po <- po %>% select(-species)
#str(po) you can have a look at the structure

We bind the presence and absence data to one data.frame and can delete the geom column.

data <- rbind(po,bg)%>%as.data.frame()%>%dplyr::select(-geom)

Now we can plot a sample table with the first 10 rows of our data to look if everything is okay up to here.

# for this plot we use the gt package
head(data, 10) %>% 
  gt() %>%
tab_header(title = md("*Aglais caschmirensis* data summary")) %>%
 fmt_number(
    columns = c(bio_1:bio_19),
    decimals = 3# for a better overview we set the decimal values to 3
  ) %>%
tab_style(
style = list(cell_fill(color = "lightblue"),
cell_text(weight = "bold")),
locations = cells_body(columns = c(x, y))) %>%
tab_style(
style = list(cell_fill(color = "lightyellow"),
cell_text(weight = "bold")),
locations = cells_body(columns = c(bio_1:bio_19)) 
)

x	y	ID	bio_1	bio_2	bio_3	bio_4	bio_5	bio_6	bio_7	bio_8	bio_9	bio_10	bio_11	bio_12	bio_13	bio_14	bio_15	bio_16	bio_17	bio_18	bio_19	occ
Aglais caschmirensis data summary
71.62838	35.92580	2	−3.513	9.158	27.837	851.534	13.300	−19.600	32.900	−10.417	−1.783	6.917	−13.650	707.000	122.000	18.000	57.152	319.000	82.000	107.000	205.000	1
71.69917	35.81516	3	8.062	10.942	32.759	843.411	25.600	−7.800	33.400	6.500	9.583	18.433	−2.267	426.000	80.000	14.000	64.544	213.000	49.000	63.000	101.000	1
71.70334	35.87940	4	7.496	10.942	31.351	901.777	26.000	−8.900	34.900	5.983	9.067	18.550	−3.617	386.000	74.000	13.000	66.715	197.000	45.000	53.000	91.000	1
71.70810	35.98351	5	4.304	10.292	29.405	911.104	22.900	−12.100	35.000	2.717	5.883	15.467	−6.850	395.000	75.000	11.000	64.160	196.000	45.000	57.000	97.000	1
71.77320	36.15080	6	−5.754	9.442	28.786	837.166	10.700	−22.100	32.800	−12.633	−3.950	4.433	−15.667	715.000	116.000	20.000	52.535	308.000	91.000	110.000	210.000	1
71.78736	35.36315	7	8.238	10.775	31.973	872.213	25.700	−8.000	33.700	6.883	9.850	18.850	−2.633	610.000	107.000	18.000	57.613	280.000	70.000	116.000	144.000	1
71.91097	36.11939	9	−1.250	10.333	30.039	867.614	16.400	−18.000	34.400	−3.417	0.717	9.317	−11.617	549.000	95.000	16.000	55.037	246.000	68.000	87.000	148.000	1
72.45343	34.91264	10	17.717	13.400	39.067	802.562	34.900	0.600	34.300	26.317	13.600	27.450	7.533	895.000	139.000	24.000	51.952	321.000	101.000	306.000	203.000	1
72.49574	36.10739	11	2.808	11.433	31.937	901.410	21.100	−14.700	35.800	1.200	4.617	13.717	−8.300	339.000	55.000	10.000	49.481	149.000	42.000	73.000	75.000	1
72.49646	36.04957	12	4.517	11.567	31.177	954.195	23.800	−13.300	37.100	2.850	6.367	16.083	−7.233	317.000	51.000	9.000	49.974	140.000	39.000	71.000	67.000	1

5.4 Let’s run the first model

In the next task we build the model with the caret package and cforest as the method. We selected for the controls cforest_unbiased with an mtry of 3 and ntree of 50, the default is 5 and 500, but this needs a lot of computation time and computer resources. After train the model we can use the varImp function to have a look at the importance of the variables.

model1 <- caret::train(occ ~ bio_1 + bio_2 + bio_3 + bio_4 + bio_5 + bio_6 + bio_7 + bio_8 + bio_9 + bio_10 + 
                             bio_11 + bio_12 + bio_13 + bio_14 + bio_15 + bio_16 + bio_17 + bio_18 + bio_19,
                       data = data,
                       method = "cforest", 
                       controls = cforest_unbiased(mtry = 3, ntree = 50),
                       na.action = na.exclude)

terra::saveRDS(model1, file.path(wd, "models", file = "model1.rds"))

In the variable importance plot You can see that bio_17 (Precipitation of Driest Quarter) and bio_1 (Annual Mean Temperature) has the highest importance on the model1 and bio_15 (Precipitation Seasonality) plays almost no role for the training.

# for plotting the variable importance we use the ggplot2 package
ggplot2::ggplot(varImp(model1))+ 
  ggtitle("Variable importance of model1")

Now it’s time for predicting and saving the result as a SpatRaster. For the prediction and saving we use the terra package.

predModel1 <- terra::predict(object = raster, # our raster for the prediction
                              model = model1, # the model we trained in the step before
                              OBB=TRUE, # default for cforest
                              na.rm = T, # cforest can not handle NAs
                              cores=4, # for faster computation time You can set here the number of the cores Your computer has installed
                              cpkgs="party") # to use the cores You need link the package Your used method belongs to 
                                             # for more information have a look at the terra package documentation 

terra::writeRaster(predModel1, file.path(wd, "predictions", "prediction_Aglais_caschmirensis.tif"), overwrite=TRUE)# save the prediction raster

We plot the prediction raster with the mapview package. It shows the habitat suitability of the Aglais caschmirensis butterfly species in comparison to the occurance points.

rastPred <- raster::raster(file.path(wd, "predictions","prediction_Aglais_caschmirensis.tif"))
mapview(rastPred, zcol = "values", layer.name = "Habitat suitability of Aglais caschmirensis", col.regions = RColorBrewer::brewer.pal(11, "YlOrRd"), map.types = "Esri.WorldImagery") + 
mapview(po, layer.name = "Presence points")

5.5 Results

For evaluation we use the Boyce Index from the ecospat Package and with the Metrics package we calculate the AUC. The area under the curve (AUC) is an accuracy measure which explains the precision of the model. A value of 1 would be a perfect fitting of the model. Instead of this, the value of 0.5 indicate that the model is not better than random (Janitza et al. 2013, p. 3ff.). The Boyce index indicates the extent to which model predictions vary from random predictions. Unlike the AUC, the Boyce index only requires presence points. The value of the Boyce index ranges from 0 to +1, where a positive value indicates high prediction accuracy, and a value close to zero suggests that the prediction does not differ from a random guess (Bellard et al. 2013, p. 3742).

# load your testdata
predRast <- terra::rast(file.path(wd, "predictions", "prediction_Aglais_caschmirensis.tif"))
testData <- sf::read_sf(file.path(wd, "data", "Aglais_caschmirensis_testData.gpkg"))

# calculate and plot the boyce-index:
boyceIndex <- ecospat::ecospat.boyce(fit=raster(predRast), obs=testData)

boyceIndex <- boyceIndex$cor

#extract the values of the testdata from the prediction raster for AUC and MAE
extrTest <- terra::extract(predRast, testData)
colnames(extrTest) <- c( "ID"  , "predcicted")
extrTest$observed <- 1

# load background data and extract
bg <- sf::read_sf(file.path(wd, "data", "background.gpkg"))
extrBg <- terra::extract(predRast, bg)
colnames(extrBg) <- c( "ID"  , "predcicted")
extrBg$observed <- 0

# bind both dataframes together:
testData_boyce <- rbind(extrTest, extrBg)
#rm(extrTest, extrBg,bg)

# calculate AUC and MAE
AUC <- Metrics::auc(actual = testData_boyce$observed, predicted = testData_boyce$predcicted)

print(AUC)

## [1] 0.9797

print(boyceIndex)

## [1] 0.687

6 Final model and species richness map

After testing the model on one single species we can go for a larger prediction using more different species and a for loop for modeling.

We load all the needed data again and use a filter for the amount of different species, for demonstrating we use a threshold of 50 presence point per species. If You have a powerful computer You can use lower thresholds but in that case the computation time is longer.

rm(list=ls()) # first we clean again the workspace
gc()

##           used (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells 5691539  304   10985567  586.7   7968538  425.6
## Vcells 9429096   72  279432890 2132.0 436477280 3330.1

Setting the working directory

wd = "C:\\UNI\\SoSe_23\\SDM\\workflowSDM"
setwd(wd)

Load the data

data <- read.csv(file.path(wd, "data", "PakistanLadakh.csv"))
colnames(data) <- c("species","x","y")

Filter the data

speciesButterflies <- table(data$species)

x <- 50  # Threshold can vary, depends on the power of the used pc hardware
filteredSpecies <- names(speciesButterflies[speciesButterflies >= x | speciesButterflies == 0])

data <- data[data$species %in% filteredSpecies, ]

# We are transforming the data.frame again into an sf object
data <- sf::st_as_sf(data, coords=c("x", "y"), remove=F, crs=sf::st_crs("epsg:4326"))

Loading the environmental data raster from the first modeling

raster <- terra::rast(file.path(wd, "data", "bioclimPak.tif"))

Prepairing everything for the loop function

To find a threshold for the prediction we use a function developed by Cecina Babich Morrow, for more information you can click on the following link: Thresholding species distribution models

# function of Cecina Babich Morrow
sdm_threshold <- function(sdm, occs, type = "mtp", binary = FALSE){
  occPredVals <- raster::extract(sdm, occs)
  if(type == "mtp"){
    thresh <- min(na.omit(occPredVals))
  } else if(type == "p10"){
    if(length(occPredVals) < 10){
      p10 <- floor(length(occPredVals) * 0.9)
    } else {
      p10 <- ceiling(length(occPredVals) * 0.9)
    }
    thresh <- rev(sort(occPredVals))[p10]
  }
  sdm_thresh <- sdm
  sdm_thresh[sdm_thresh < thresh] <- NA
  if(binary){
    sdm_thresh[sdm_thresh >= thresh] <- 1
  }
  return(sdm_thresh)
}

After that we set some parameters for the loop and create some lists to store the results of each model and prediction.

numSamples <- 100# we set the sample for absence points to 100 for faster computation time

`%not_in%`<- Negate(`%in%`)# part of the data splitting function
# lists to store the results while the loop is running
models <- list()# model of every unique species
predictions <- list()# predictions of every unique species
boyceIndices <- list()# boyce Indices
aucValues <- list()# auc values
maeValues <- list()# mean absolute error for evaluation
sdmMTP <- list()# the predicted rasters with the threshold of Cecina Babich Morrow

Now it’s time for the loop.

for (i in filteredSpecies) {
  
  dataSubset <- subset(data, species %in% i)# subset for every unique species in the data
  
  dataSpeciesExtr <- terra::extract(raster, dataSubset)
  dataSpecies <- cbind(dataSubset, dataSpeciesExtr)
  dataSpecies$occ <- 1# setting the occurance data to 1
  
  # splitting the data in training and testing
  set.seed(2023)
  testData <- dplyr::slice_sample(dataSpecies, prop=.25)# 25% for testing data, rest for training
  testData <- na.omit(testData)
  trainDataMTP <- dplyr::filter(dataSpecies, ID %not_in% testData$ID)
  trainData <- trainDataMTP %>% select(-species)
  
  # making absence points
  absenceSamples <- sf::st_as_sf(as.data.frame(dismo::randomPoints(mask=raster::stack(raster), 
                                                                   n=numSamples)), # number of samples 
                                                                   crs=terra::crs(raster), 
                                                                   coords=c("x","y"), remove=F)
  
  absenceValuesExtr <- terra::extract(raster, absenceSamples)
  absenceValues <- cbind(absenceSamples, absenceValuesExtr)
  absenceValues$occ <- 0# setting the absence data to 0 
  
  trainData <- rbind(trainData, absenceValues) %>% as.data.frame ()%>% dplyr::select(-geometry)
  
  model <- caret::train(occ ~ bio_1 + bio_2 + bio_3 + bio_4 + bio_5 + bio_6 + bio_7 + bio_8 + bio_9 + bio_10
                              + bio_11 + bio_12 + bio_13 + bio_14 + bio_15 + bio_16 + bio_17 + bio_18 + bio_19,
                        data = trainData,
                        method = "cforest", 
                        controls = cforest_unbiased(mtry = 1, ntree = 50),
                        na.action = na.exclude)
  
  
  models[[i]] <- model
  modelFile <- paste0(i, "_model_1_50.rds")
  terra::saveRDS(model, file.path(wd, "models", file = modelFile))
  
  predict <- terra::predict(object = raster, 
                            model = model, 
                            OBB = TRUE, 
                            na.rm = TRUE, 
                            cores = 4, 
                            cpkgs = "party", 
                            se.fit=TRUE)
  
  predictions[[i]] <- predict
  
  predictionFile <- paste0(i, "_predict_1_50.tif")
  terra::writeRaster(predict, file.path(wd, "predictions", file = predictionFile), overwrite=TRUE)
  
  # calculating the boyce index
  boyceIndex <- ecospat::ecospat.boyce(fit = raster(predict), obs = testData)
  boyceIndices[[i]] <- boyceIndex$cor
  
  # calculate the AUC and MAE
  extrTest <- terra::extract(predict, testData)
  colnames(extrTest) <- c("ID", "predicted")
  extrTest$observed <- 1
  
  extrAv <- terra::extract(predict, absenceValues)
  colnames(extrAv) <- c("ID", "predicted")
  extrAv$observed <- 0
  extrAv <- na.omit(extrAv)
  
  testData <- rbind(extrTest, extrAv)
  
  maeValue <- Metrics::mae(actual = testData$observed, predicted = testData$predicted)
  maeValues[[i]] <- maeValue
  aucValue <- Metrics::auc(actual = testData$observed, predicted = testData$predicted)
  aucValues[[i]] <- aucValue
  
  # predictions with the Morrow threshold
  sdm <- sdm_threshold(raster(predict), trainDataMTP[,2:3], "mtp", binary = TRUE)
  sdmMTP[i] <- sdm
  predictThresFile <- paste0(i, "_predictThres_1_50.tif")
  raster::writeRaster(sdm, file.path(wd, "predictions", file = predictThresFile), overwrite=TRUE)
}

For Evaluation we use the Boyce Index and the AUC again and additional the MAE (Mean Absolute Error). Therefore we make a summary data.frame and store every value from each prediction in it. The summary shows a high variance in the Boyce Index if You look at the the table. This might be the effect of a poor number of presence points collected in the field work. So the models are strictly limited to the amount of presence points.

summary <- data.frame(species = filteredSpecies,
                      BoyceIndex = unlist(boyceIndices),
                      AUC = unlist(aucValues),
                      MAE = unlist(maeValues))
           
saveRDS(summary, file = "predictions/summary_1_50.Rds") # Save data.frame

First load the saved summary data.

summary <- readRDS(file = "C:\\UNI\\SoSe_23\\SDM\\workflowSDM\\predictions\\summary_1_50.Rds")

Now we can plot the evaluation summary by using the gt package.

# for this plot we use the gt package
summary %>% 
  gt() %>%
tab_header(title = md("Summary of the evaluation indices")) %>%
 fmt_number(
    columns = c(BoyceIndex, AUC, MAE),
    decimals = 3# for a better overview we set the decimal values to 3
  ) %>%
tab_style(
style = list(cell_fill(color = "lightblue"),
cell_text(weight = "bold")),
locations = cells_body(columns = c(BoyceIndex, AUC, MAE))) %>%
tab_style(
style = list(cell_fill(color = "lightyellow"),
cell_text(weight = "bold", style = "italic")),
locations = cells_body(columns = species) 
)

species	BoyceIndex	AUC	MAE
Summary of the evaluation indices
Aglais_caschmirensis	0.467	0.944	0.150
Anaphaeis_aurota	0.872	0.898	0.231
Colias_erate	0.543	0.896	0.236
Colias_fieldii	0.835	0.908	0.225
Danaus_chrysippus	0.919	0.832	0.253
Eurema_hecabe	0.749	0.845	0.197
Gonepteryx_nepalensis	0.827	0.924	0.192
Junonia_orithya	0.775	0.854	0.177
Lycaena_phlaeas	0.829	0.941	0.195
Papilio_demoleus	0.457	0.842	0.171
Papilio_machaon	0.687	0.927	0.194
Pieris_brassicae	0.814	0.880	0.212
Pieris_canidia	0.694	0.937	0.163
Pontia_daplidice	0.847	0.944	0.140
Vanessa_cardui	0.875	0.820	0.239

As a last part we load the saved predictions evaluated by the Morrow threshold function and prepare the data to plot a species richness map.

sdmStack <- stack(sdmMTP)# make one stack raster with the raster stack function
# stackApply function to stack and count the predictions of the unique species in a binary format, 1 for predicted occurence and 0 for absence 
sdmStack <- stackApply(sdmStack, indices = c(1), fun = sum, na.rm = TRUE)

# save the raster for later use
raster::writeRaster(sdmStack, file.path(wd, "predictions", "sdmStack_1_50.tif"), overwrite=T)

We read again the raster stack and plot with mapview.

gadmPak <- terra::vect(file.path(wd, "data", "gadmPak.gpkg"))# first we load gadm of Pakistan
gadmPak <- sf::st_as_sf(gadmPak)

There You can see that most of the species are predicted in the north of Pakistan, so that’s what we expected from the model. All in all it looks logical. If You want to improve the results You can go for hyperparameter tuning. For cforest it is recommended in the caret package to play with the mtry values. But this takes a lot more computation time so we skipped this part.

sdmStack <- raster::raster(file.path(wd, "predictions","sdmStack_1_50.tif"))
sdmStackMask <- raster::mask(sdmStack, gadmPak)# mask the predictions to the border of Pakistan
map1 <- mapview(sdmStackMask, zcol = "values", layer.name = "Species richness map1", col.regions = RColorBrewer::brewer.pal(11, "YlOrRd"), map.types = "Esri.WorldImagery", alpha.regions = 1)+
  mapview(gadmPak, alpha = 0.5, alpha.regions = 0.0, layer.name = "Border and districts of Pakistan", color = "blue")
map1

Finally, let’s plot a map to compare the predicted values with the map predicted by Morrow. Therefore we set our threshold to 0.5.

threshold = 0.5# setting a threshold for the binary sdm map

binarySpecies = list()
# function to stack all the binary predicted values into one list
m <- c(0, threshold, 0,
       threshold, 1, 1)
m <- matrix(m, ncol=3, byrow = TRUE)

for (i in filteredSpecies) {
  
  r <- terra::rast(predictions[i])
  binarySpecies[i] <- terra::classify(r, 
                                       m, 
                                       include.lowest = TRUE)
  rm(r)
}
# now we stack the SpatRasters stored in the list into one single SpatRaster
stack <- terra::rast(binarySpecies)
stack <- sum(stack)
# transform the SpatRaster into a Large RasterLayer by using raster package to plot it with mapview
# this task is not necessary, You can also do a simple plot of the stack by using the plot function
stackedRaster <- raster::raster(stack)
# save the raster for later use
raster::writeRaster(stackedRaster, file.path(wd, "predictions", "stackedRaster_1_50.tif"), overwrite=T)

# load the stacked RasterLayer
stackedRaster <- raster::raster(file.path(wd, "predictions","stackedRaster_1_50.tif"))
stackedRaster <- raster::mask(stackedRaster, gadmPak)# mask the predictions to the border of Pakistan
map2 <- mapview(stackedRaster, zcol = "values", layer.name = "Species richness map2", col.regions = RColorBrewer::brewer.pal(11, "YlOrRd"), map.types = "Esri.WorldImagery", alpha.regions = 1)+
  mapview(gadmPak, alpha = 0.5, alpha.regions = 0.0, layer.name = "Border and districts of Pakistan", color = "blue")
map2

7 Sources

7.1 Bibliography:

Bellard, C., Thuiller, W., Leroy, B., Genovesi, P., Bakkenes, M. & F. Courchamp (2013): Will Climate change promote future invasions. Global Change Biology, Volume 19, Issue 12. Page: 3742.

BREIMAN, L. (2001): Random Forests. Machine Learning. 45: 5–32.

HOTHORN, T., HORNIK, K. & A. ZEILEIS (2006): Unbiased Recursive Partitioning: A Conditional Inference
Framework. Journal of Computational and Graphical Statistics 15/3: 651–674.

HOTHORN, T., LAUSEN, B., BENNER, A. & M. RADESPIEL-TRÖGER (2004): Bagging survival trees. Statistics in Medicine 23: 77–91.

Janitza S., Strobl C. & A.-L. Boulesteix (2013) An AUC-based permutation variable importance measure for random forests. Bioinformatics, 14:119.

NEMBRINI, S., KÖNIG, I. R. & M. N. WRIGHT (2018): The revival of the Gini importance? Bioinformatics 34: 3711–3718.

STROBEL, C., MALLEY, J. & G. TUTZ (2009a): An Introduction to Recursive Partition. Munich.

STROBEL, C., HOTHORN, T. & A. ZEILEIS (2009b): Party on! A New, Conditional Variable Importance Measure for Random Forests Available in the party Package. Munich.

STROBEL, C., BOULESTEIX, A.-C., ZEILEIS, A. & T. HOTHORN (2007): Bias in random forest variable importance measures: Illustrations, sources and a solution. Bioinformatics 8/25: o. A. WORLDCLIM (2022): Bioclimatic variables. https://worldclim.org/data/bioclim.html (Zugriff: 03.06.2023).

7.2 Images

Figure 1: Aglais Cashmirensis (Zugriff: 07.06.2023)

Figure 2: Own representation

Figure 3: Own representation

8 Session info

sessionInfo()

## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## attached base packages:
## [1] stats4    grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] leaflet.extras2_1.2.1 gt_0.9.0              maptiles_0.5.0       
##  [4] leaflet_2.1.2.9000    ecospat_3.5.1         Metrics_0.1.4        
##  [7] party_1.3-13          strucchange_1.5-3     sandwich_3.1-0       
## [10] zoo_1.8-12            modeltools_0.2-23     mvtnorm_1.2-2        
## [13] mapview_2.11.0        dplyr_1.1.2           dismo_1.3-14         
## [16] raster_3.6-20         sp_2.0-0              geodata_0.5-8        
## [19] terra_1.7-37          sf_1.0-13             caret_6.0-94         
## [22] lattice_0.21-8        ggplot2_3.4.2         knitr_1.43           
## 
## loaded via a namespace (and not attached):
##   [1] uuid_1.1-0              backports_1.4.1         Hmisc_5.1-0            
##   [4] systemfonts_1.0.4       plyr_1.8.8              splines_4.2.2          
##   [7] crosstalk_1.2.0         listenv_0.9.0           TH.data_1.1-2          
##  [10] leafpop_0.1.0           digest_0.6.31           foreach_1.5.2          
##  [13] htmltools_0.5.5         earth_5.3.2             leaflet.providers_1.9.0
##  [16] fansi_1.0.4             checkmate_2.2.0         magrittr_2.0.3         
##  [19] cluster_2.1.4           ks_1.14.0               recipes_1.0.6          
##  [22] globals_0.16.2          gower_1.0.1             matrixStats_1.0.0      
##  [25] svglite_2.1.1           hardhat_1.3.0           timechange_0.2.0       
##  [28] colorspace_2.1-0        xfun_0.39               leafem_0.2.0           
##  [31] jsonlite_1.8.5          libcoin_1.0-9           brew_1.0-8             
##  [34] survival_3.5-5          iterators_1.0.14        ape_5.7-1              
##  [37] glue_1.6.2              PresenceAbsence_1.1.11  gtable_0.3.3           
##  [40] ipred_0.9-14            webshot_0.5.4           future.apply_1.11.0    
##  [43] abind_1.4-5             scales_1.2.1            DBI_1.1.3              
##  [46] Rcpp_1.0.10             plotrix_3.8-2           htmlTable_2.4.1        
##  [49] units_0.8-2             foreign_0.8-84          proxy_0.4-27           
##  [52] mclust_6.0.0            Formula_1.2-5           lava_1.7.2.1           
##  [55] prodlim_2023.03.31      htmlwidgets_1.6.2       RColorBrewer_1.1-3     
##  [58] ellipsis_0.3.2          nabor_0.5.0             farver_2.1.1           
##  [61] pkgconfig_2.0.3         reshape_0.8.9           nnet_7.3-19            
##  [64] sass_0.4.6              utf8_1.2.3              labeling_0.4.2         
##  [67] tidyselect_1.2.0        rlang_1.1.1             reshape2_1.4.4         
##  [70] munsell_0.5.0           TeachingDemos_2.12      tools_4.2.2            
##  [73] cachem_1.0.8            xgboost_1.7.5.1         cli_3.6.1              
##  [76] generics_0.1.3          ade4_1.7-22             evaluate_0.21          
##  [79] stringr_1.5.0           fastmap_1.1.1           yaml_2.3.7             
##  [82] maxnet_0.1.4            ModelMetrics_1.2.2.2    purrr_1.0.1            
##  [85] randomForest_4.7-1.1    satellite_1.0.4         coin_1.4-2             
##  [88] future_1.32.0           nlme_3.1-162            xml2_1.3.4             
##  [91] pracma_2.4.2            biomod2_4.2-4           compiler_4.2.2         
##  [94] rstudioapi_0.14         png_0.1-8               e1071_1.7-13           
##  [97] tibble_3.2.1            bslib_0.5.0             stringi_1.7.12         
## [100] highr_0.10              plotmo_3.6.2            poibin_1.5             
## [103] Matrix_1.5-4.1          commonmark_1.9.0        markdown_1.7           
## [106] classInt_0.4-9          vegan_2.6-4             gbm_2.1.8.1            
## [109] permute_0.9-7           vctrs_0.6.3             pillar_1.9.0           
## [112] lifecycle_1.0.3         jquerylib_0.1.4         data.table_1.14.8      
## [115] R6_2.5.1                KernSmooth_2.23-21      gridExtra_2.3          
## [118] parallelly_1.36.0       codetools_0.2-19        MASS_7.3-60            
## [121] gtools_3.9.4            withr_2.5.0             multcomp_1.4-25        
## [124] mgcv_1.8-42             parallel_4.2.2          rpart_4.1.19           
## [127] timeDate_4022.108       class_7.3-22            rmarkdown_2.22         
## [130] mda_0.5-4               pROC_1.18.2             lubridate_1.9.2        
## [133] base64enc_0.1-3

Species distribution modeling with cforest

Jannes Breidenbach, Enrico Kintscher, Malte Vandamme

11.07.2023