The following tutorial is part of the seminar
“Species
Distribution Modeling”. The aim of this tutorial is to introduce You
to the cforest method by modeling a species
distribution of butterfly species in Pakistan and creating a species
richness map from it.
Cforest, like Random Forest, is a statistical method for performing regression analyses. It uses an algorithm to create probability trees based on several independent variables (HOTHORN et al. 2006, p. 1). In contrast to linear regressions, random forest and cforest have the advantage that they can process large amounts of data with many complex variables in a relatively short time (STROBEL 2009a, p. 4). This fits well for our project to predict potential butterfly occurrences with hundreds of butterfly sites and a lot of biovariates like annual rainfall or temperature data.
As you see below, a decision tree is very easy to visualize. It consists of different internal nodes that divide sampled data into two categories by an independent variable. Through this procedure, the impurity of the data will be reduced. At the end of each branch are the leaf nodes which contain the splitted data. If one leaf node contains just one category of the data this leaf node is pure and the impurity value is zero (STROBEL et al. 2009a, p. 9). If one leaf node has more than one category, the impurity rate is higher (ibid.). However, the category with the most values will win the decision when the data of the leaf node is aggregated (BREIMAN 2001, p. 6). Below you can see a little example of a decision tree with the important vocabulary’s internal node, branch, variable and leaf node.
Cforest is a further development of Random Forest and works only slightly different. To understand cforest, it is therefore useful to explain Random Forest first.
Random Forest is characterized by the random selection of samples and variables (STROBEL et al. 2009a, p. 16). From a sample (for example our butterfly data) subsets are drawn at random, from which decision trees are then formed with randomly selected variables. At the end of the decision trees, i.e. in the leaf nodes, the individual observations can be found in classes. Thereby, the purity of the data summarized in classes was significantly increased by splitting in the nodes before. At the end, the observations of the classes are aggregated to one result. The result, which occurs in the most classes, wins (BREIMAN 2001, S. 1). The repeated dragging and dropping of the subsets from the data is called bootstrapping. Together with aggregating the data at the end of the decision trees, this process is also called bagging (HOTHORN et al. 2004, p. 78). Such a procedure results in good stability of the model and is better than just one decision tree (ibid.). This is because within individual decision trees there will certainly be individual errors. But if thousand such random decision trees are formed several times and the results are aggregated together, then the prediction will be very good on average. If you want do dive deeper in the Theory of Random Forests, we can recommend you the student tutorial from Mandy Gimpel which you can find here: Using Random Forest in R for Butterfly Fauna Distribution in Pakistan (geomoer.github.io).
After explaining Random Forest to you, we can move on to the further development of this model: to cforest. One problem with Random Forest’s decision trees is that they estimate certain variables to be more important than they actually are (STROBEL et al. 2009b, p. 1). These are especially variables with many categories or continuous data but also variables which have many missing measurement data (STROBEL et al 2009a, p. 30). This creates a bias that affects the importance of the variables and impacts the validity of the model (ibid.). Such a bias arises especially when many heterogeneous variables with different numbers of categories are used to create the decision trees (HOTHORN et al. 2006, p. 2). The cause of the incorrect estimation are calculations of the Gini importance or the permutation importance measurement of a variable. These two tests show how well a variable can clean up the data in a node by splitting the data (STROBEL et al. 2009a, p. 8). To put it in a nutshell, when the variable importance measure is biased towards variables with a lot of categories, these variables will be preferred to build the decision tree so the output of the model will be biased in an incorrect direction (STROBEL et al. 2007, p. 2). Since we use data with a high range of continuous data, i.e. the annual precipitation in the variable bio 12 (WORLDCLIM 2022) and data with fewer categories, i.e. the min temperature of the coldest month in the variable bio 06 in predicting butterfly occurrence, it makes sense to avoid such bias in the importance of variables with cforest. One more note: there are even more studies that are trying to solve the problem of biased variable selection using a corrected impurity measure (AIR) (NEMBRINI et al. 2018, p. 3711f.).
But how does cforest manage to circumvent this bias within the independent variables? This is done by measuring the significance of each variable selected for a node in the decision tree. To be absolutely clear, cforest creates random forests but the Gini importance and the permutation importance measurement for the selection of variables in the trees are taken out of the race (STROBEL et al. 2007, p. 3). Instead of this, the variable that has the highest significance, i.e. explains the largest part of the independent variable, is used as the first splitting variable within a node in the decision tree (HOTHORN et al., 2006, p. 2). All other variables that are also significant and thus explain a certain proportion of our dependent variable can also be used in the decision tree. However, if the significance test of a variable turns out to be negative, i.e. a p-test for example above 0.05%, then this tree is terminated (ibid., p. 4). This is to circumvent the bias of variable weighting, to achieve a better predictive accuracy of the model and to be able to analyse data with heterogeneous variables. We will call the trees that are created with the cforest function in R conditional inference trees (STROBEL et al. 2007, p. 5). Below is an example of one of our decision trees created with c-forest. The p-value is below 0.05 in all branches, which is a good result and explains why the variables are used to split the data into smaller categories.
Unfortunately, there is also a small problem with the selection of
variables in cforest. If bootstrapping, i.e. the
repeated dragging and replacement from a sample, is performed with
heterogeneous variables, then variables with many categories are
also preferred in cforest (STROBEL et al. 2007, p. 17). This
problem can be solved if the drawing of the variables is done
without replacement. Then, no variables with many
categories are used more often than other variables for creating the
decision trees (ibid.).
Another problem with using decision trees is
overfitting (HOTHORN et al. 2006, p. 2). Overfitting
occurs when the model learns the structures inherent in the
learning sample (STROBEL et al 2009a, p. 9). However, these
structures are not found in reality, but are only created by the
arbitrary selection of the sample. To prevent this
adaptation of the model to the learning data, the length of the
decision trees is limited. They can simply be pruned from a
certain number of nodes or they can be interrupted by a
statistical criterion. In cforest, for
example, the continuation of the decision tree can be
interrupted as soon as a selected significance level is not
reached (HOTHORN et al. 2006, p. 7). This also addresses the
question of how long decision trees should grow. STROBEL et al. also
points out that it can be useful to limit the length of the
decision trees to avoid overfitting (STROBEL et al. 2009a,
p. 10). Especially with a large number of data, the predictive accuracy
of the model seems to increase when using many but small trees. It is
different for small samples. In this case, the
use of large trees makes sense (ibid., p. 33). Since we
often have a small sample of butterfly species, we should apply this
advice and use large decision trees.
For our species distribution model, we will use the free available
Bioclimatic variables from WorldClim
dataset for Pakistan. This is part of the geodata
package. They contain 19 different variables which are
generated from monthly temperature and rainfall data
(WorldClim 2022). When you have a closer look, you can see that they
measure annual trends, for example the mean annual
temperature, seasonality, represented in annual data
and extreme environmental factors (ibid.). An example
for an extreme factor is the precipitation of the driest quarter and
could be a limiting environmental factor for the
occurrence of our butterflies in Pakistan. Below you can see all 19
variables which are part of the WorldClim biodata variables:
BIO1 = Annual Mean Temperature
BIO2 = Mean Diurnal Range (Mean of monthly (max temp - min temp))
BIO3 = Isothermality (BIO2/BIO7) (×100)
BIO4 = Temperature Seasonality (standard deviation ×100)
BIO5 = Max Temperature of Warmest Month
BIO6 = Min Temperature of Coldest Month
BIO7 = Temperature Annual Range (BIO5-BIO6)
BIO8 = Mean Temperature of Wettest Quarter
BIO9 = Mean Temperature of Driest Quarter
BIO10 = Mean Temperature of Warmest Quarter
BIO11 = Mean Temperature of Coldest Quarter
BIO12 = Annual Precipitation
BIO13 = Precipitation of Wettest Month
BIO14 = Precipitation of Driest Month
BIO15 = Precipitation Seasonality (Coefficient of Variation)
BIO16 = Precipitation of Wettest Quarter
BIO17 = Precipitation of Driest Quarter
BIO18 = Precipitation of Warmest Quarter
BIO19 = Precipitation of Coldest Quarter
First we clean the environment and use the garbage collector to free
some memory
rm(list=ls())# removes everything in the environment
gc()# garbage collector
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 514278 27.5 1147244 61.3 644242 34.5
## Vcells 916801 7.0 8388608 64.0 1635552 12.5
Used packages
library(caret)
library(sf)
library(geodata)
library(dismo)
library(dplyr)
library(mapview)
library(party)
library(terra)
library(Metrics)
library(ecospat)
library(raster)
library(leaflet)
library(maptiles)
library(ggplot2)
library(gt)
library(leaflet.extras2)
Working directory
= "C:\\UNI\\SoSe_23\\SDM\\workflowSDM"# the path varies on your structure
wd setwd(wd)
We start by loading the species file “PakistanLadakh.csv” provided to us
in the course. It is saved in CSV format and contains individual
butterfly species with presence points and xy coordinates collected in
Pakistan.
<- read.csv(file.path(wd, "data", "PakistanLadakh.csv"))# we are using the standard read.csv function to read in the data data
And change the column names
colnames(data) <- c("species","x","y")
We can now count the number of occurrences per species to see a bit of the data structure.
<- unique(data$species)
spec_names cat("The Dataset contains", format(length(spec_names)), "different species
with overall", format(length(data$species)), "presence entries,
and an average of", format(round(length(data$species)/length(spec_names),0)), "presence entries per species.")
## The Dataset contains 429 different species
## with overall 5870 presence entries,
## and an average of 14 presence entries per species.
After that we load the environmental variables for Pakistan. For this task we just used the bioclimate ‘bio’ data provided by WorldClim climate data in the geodata package. You can also use a lot more variables as predictors as you can see on the github page “rspatial / geodata”.
# get the environmental layers
<- file.path(wd, "data")
subfolderPath <- geodata::worldclim_country(country="PAK", res=10, var="bio", path=subfolderPath, version = "2.1")
raster names(raster) <- substr(names(raster), 11, 20)
# get the gadm (Database of Global Administrative Areas) border of Pakistan also from the geodata package
<- geodata::gadm(country="Pakistan", level=0, path=subfolderPath)
gadmPak
# and mask the raster layer to the border of Pakistan
<- terra::mask(raster, gadmPak)
raster
# save the raster for later use
::writeRaster(raster, file.path(wd, "data", "bioclimPak.tif"), overwrite=T) terra
For testing the model we selecting just one single species from the provided data. In this case Aglais caschmirensis.
<- dplyr::filter(data, species=="Aglais_caschmirensis")# using the filter function from dplyr package
Aglais_caschmirensis
# We are transforming the data.frame into an sf object for later using some functions in the terra package. Therefore we use the sf package
# You can ckeck the crs (coordinate reference system) by using the crs function in the terra package (example: terra::crs(raster))
<- sf::st_as_sf(Aglais_caschmirensis, coords=c("x", "y"), remove=F, crs=sf::st_crs("epsg:4326")) Aglais_caschmirensis
Next we create some background absence points, for this example 1000.
= 1000 # we use randomly generated 1000 sample points, You can also use more or less points
n # for generating we use the dismo package and the randomPoints function
<- sf::st_as_sf(as.data.frame(dismo::randomPoints(mask=raster::stack(raster),
bg n=n)),
crs=terra::crs(raster),
coords=c("x","y"), remove=F)
# after generating we extract the environmental information for background data from our SpatRaster raster
<- terra::extract(raster, bg)
bg_extr <- cbind(bg,bg_extr);rm(bg_extr)
bg
# and binding the extracted background absence points with the extracted presence points from the butterfly species to one data.frame
<- terra::extract(raster, Aglais_caschmirensis)
Aglais_caschmirensis_extr <- cbind(Aglais_caschmirensis, Aglais_caschmirensis_extr);rm(Aglais_caschmirensis_extr) Aglais_caschmirensis
We can now plot the data to look were the butterflies are in presence. Therefore we use the leaflet package.
leaflet(data = Aglais_caschmirensis) %>% addTiles() %>%
setView(lat = 33.7, lng = 73.1, zoom = 4) %>%
addProviderTiles(providers$Esri.WorldImagery, group = "Esri World Imagery") %>%
addAwesomeMarkers(~x, ~y, popup = ~ID, label = ~ID, group = "presence points of Aglais caschmirensis") %>%
addPolygons(data = gadmPak, col="black", fillColor="Transparent",
opacity = 1, group = "Border of Pakistan") %>%
addLayersControl(
baseGroups = "Esri World Imagery",
overlayGroups = c("presence points of Aglais caschmirensis", "Border of Pakistan"),
options = layersControlOptions(collapsed = F))
And now we are splitting and saving the data of the selected butterfly species into a training and testing set by using the dplyr package. We decided for 80 percent training and 20 percent testing. If You have more available presence data points You can also go for 75/25 or 70/30 splitting.
`%not_in%`<- Negate(`%in%`)# formula to split the data
set.seed(2023) #set a seed for reproducibility
=dplyr::slice_sample(Aglais_caschmirensis, prop=.20)
testData=dplyr::filter(Aglais_caschmirensis, ID %not_in% testData$ID )
trainData
# We save the test, train and background data by using the sf package
::write_sf(testData, file.path(wd, "data", "Aglais_caschmirensis_testData.gpkg"))
sf::write_sf(trainData, file.path(wd, "data","Aglais_caschmirensis_trainData.gpkg"))
sf::write_sf(bg, file.path(wd, "data","background.gpkg")) sf
Now we load the saved training and background data created in the step before to make one data.frame with presence and absence points for the modeling.
# read the presence only data
<- sf::read_sf(file.path(wd, "data", "Aglais_caschmirensis_trainData.gpkg"))
po $occ=1
po# read the absence only data
<- sf::read_sf(file.path(wd, "data","background.gpkg"))
bg $occ=0
bg# we can delete the species column because we have just one unique species
<- po %>% select(-species)
po #str(po) you can have a look at the structure
We bind the presence and absence data to one data.frame and can delete the geom column.
<- rbind(po,bg)%>%as.data.frame()%>%dplyr::select(-geom) data
Now we can plot a sample table with the first 10 rows of our data to look if everything is okay up to here.
# for this plot we use the gt package
head(data, 10) %>%
gt() %>%
tab_header(title = md("*Aglais caschmirensis* data summary")) %>%
fmt_number(
columns = c(bio_1:bio_19),
decimals = 3# for a better overview we set the decimal values to 3
%>%
) tab_style(
style = list(cell_fill(color = "lightblue"),
cell_text(weight = "bold")),
locations = cells_body(columns = c(x, y))) %>%
tab_style(
style = list(cell_fill(color = "lightyellow"),
cell_text(weight = "bold")),
locations = cells_body(columns = c(bio_1:bio_19))
)
Aglais caschmirensis data summary | ||||||||||||||||||||||
x | y | ID | bio_1 | bio_2 | bio_3 | bio_4 | bio_5 | bio_6 | bio_7 | bio_8 | bio_9 | bio_10 | bio_11 | bio_12 | bio_13 | bio_14 | bio_15 | bio_16 | bio_17 | bio_18 | bio_19 | occ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
71.62838 | 35.92580 | 2 | −3.513 | 9.158 | 27.837 | 851.534 | 13.300 | −19.600 | 32.900 | −10.417 | −1.783 | 6.917 | −13.650 | 707.000 | 122.000 | 18.000 | 57.152 | 319.000 | 82.000 | 107.000 | 205.000 | 1 |
71.69917 | 35.81516 | 3 | 8.062 | 10.942 | 32.759 | 843.411 | 25.600 | −7.800 | 33.400 | 6.500 | 9.583 | 18.433 | −2.267 | 426.000 | 80.000 | 14.000 | 64.544 | 213.000 | 49.000 | 63.000 | 101.000 | 1 |
71.70334 | 35.87940 | 4 | 7.496 | 10.942 | 31.351 | 901.777 | 26.000 | −8.900 | 34.900 | 5.983 | 9.067 | 18.550 | −3.617 | 386.000 | 74.000 | 13.000 | 66.715 | 197.000 | 45.000 | 53.000 | 91.000 | 1 |
71.70810 | 35.98351 | 5 | 4.304 | 10.292 | 29.405 | 911.104 | 22.900 | −12.100 | 35.000 | 2.717 | 5.883 | 15.467 | −6.850 | 395.000 | 75.000 | 11.000 | 64.160 | 196.000 | 45.000 | 57.000 | 97.000 | 1 |
71.77320 | 36.15080 | 6 | −5.754 | 9.442 | 28.786 | 837.166 | 10.700 | −22.100 | 32.800 | −12.633 | −3.950 | 4.433 | −15.667 | 715.000 | 116.000 | 20.000 | 52.535 | 308.000 | 91.000 | 110.000 | 210.000 | 1 |
71.78736 | 35.36315 | 7 | 8.238 | 10.775 | 31.973 | 872.213 | 25.700 | −8.000 | 33.700 | 6.883 | 9.850 | 18.850 | −2.633 | 610.000 | 107.000 | 18.000 | 57.613 | 280.000 | 70.000 | 116.000 | 144.000 | 1 |
71.91097 | 36.11939 | 9 | −1.250 | 10.333 | 30.039 | 867.614 | 16.400 | −18.000 | 34.400 | −3.417 | 0.717 | 9.317 | −11.617 | 549.000 | 95.000 | 16.000 | 55.037 | 246.000 | 68.000 | 87.000 | 148.000 | 1 |
72.45343 | 34.91264 | 10 | 17.717 | 13.400 | 39.067 | 802.562 | 34.900 | 0.600 | 34.300 | 26.317 | 13.600 | 27.450 | 7.533 | 895.000 | 139.000 | 24.000 | 51.952 | 321.000 | 101.000 | 306.000 | 203.000 | 1 |
72.49574 | 36.10739 | 11 | 2.808 | 11.433 | 31.937 | 901.410 | 21.100 | −14.700 | 35.800 | 1.200 | 4.617 | 13.717 | −8.300 | 339.000 | 55.000 | 10.000 | 49.481 | 149.000 | 42.000 | 73.000 | 75.000 | 1 |
72.49646 | 36.04957 | 12 | 4.517 | 11.567 | 31.177 | 954.195 | 23.800 | −13.300 | 37.100 | 2.850 | 6.367 | 16.083 | −7.233 | 317.000 | 51.000 | 9.000 | 49.974 | 140.000 | 39.000 | 71.000 | 67.000 | 1 |
In the next task we build the model with the caret package and cforest
as the method. We selected for the controls cforest_unbiased with an
mtry of 3 and ntree of 50, the default is 5 and 500, but this needs a
lot of computation time and computer resources. After train the model we
can use the varImp function to have a look at the importance of the
variables.
<- caret::train(occ ~ bio_1 + bio_2 + bio_3 + bio_4 + bio_5 + bio_6 + bio_7 + bio_8 + bio_9 + bio_10 +
model1 + bio_12 + bio_13 + bio_14 + bio_15 + bio_16 + bio_17 + bio_18 + bio_19,
bio_11 data = data,
method = "cforest",
controls = cforest_unbiased(mtry = 3, ntree = 50),
na.action = na.exclude)
::saveRDS(model1, file.path(wd, "models", file = "model1.rds")) terra
In the variable importance plot You can see that bio_17 (Precipitation of Driest Quarter) and bio_1 (Annual Mean Temperature) has the highest importance on the model1 and bio_15 (Precipitation Seasonality) plays almost no role for the training.
# for plotting the variable importance we use the ggplot2 package
::ggplot(varImp(model1))+
ggplot2ggtitle("Variable importance of model1")
Now it’s time for predicting and saving the result as a SpatRaster. For the prediction and saving we use the terra package.
<- terra::predict(object = raster, # our raster for the prediction
predModel1 model = model1, # the model we trained in the step before
OBB=TRUE, # default for cforest
na.rm = T, # cforest can not handle NAs
cores=4, # for faster computation time You can set here the number of the cores Your computer has installed
cpkgs="party") # to use the cores You need link the package Your used method belongs to
# for more information have a look at the terra package documentation
::writeRaster(predModel1, file.path(wd, "predictions", "prediction_Aglais_caschmirensis.tif"), overwrite=TRUE)# save the prediction raster terra
We plot the prediction raster with the mapview package. It shows the habitat suitability of the Aglais caschmirensis butterfly species in comparison to the occurance points.
<- raster::raster(file.path(wd, "predictions","prediction_Aglais_caschmirensis.tif"))
rastPred mapview(rastPred, zcol = "values", layer.name = "Habitat suitability of Aglais caschmirensis", col.regions = RColorBrewer::brewer.pal(11, "YlOrRd"), map.types = "Esri.WorldImagery") +
mapview(po, layer.name = "Presence points")
For evaluation we use the Boyce Index from the ecospat Package and with the Metrics package we calculate the AUC. The area under the curve (AUC) is an accuracy measure which explains the precision of the model. A value of 1 would be a perfect fitting of the model. Instead of this, the value of 0.5 indicate that the model is not better than random (Janitza et al. 2013, p. 3ff.). The Boyce index indicates the extent to which model predictions vary from random predictions. Unlike the AUC, the Boyce index only requires presence points. The value of the Boyce index ranges from 0 to +1, where a positive value indicates high prediction accuracy, and a value close to zero suggests that the prediction does not differ from a random guess (Bellard et al. 2013, p. 3742).
# load your testdata
<- terra::rast(file.path(wd, "predictions", "prediction_Aglais_caschmirensis.tif"))
predRast <- sf::read_sf(file.path(wd, "data", "Aglais_caschmirensis_testData.gpkg"))
testData
# calculate and plot the boyce-index:
<- ecospat::ecospat.boyce(fit=raster(predRast), obs=testData) boyceIndex
<- boyceIndex$cor
boyceIndex
#extract the values of the testdata from the prediction raster for AUC and MAE
<- terra::extract(predRast, testData)
extrTest colnames(extrTest) <- c( "ID" , "predcicted")
$observed <- 1
extrTest
# load background data and extract
<- sf::read_sf(file.path(wd, "data", "background.gpkg"))
bg <- terra::extract(predRast, bg)
extrBg colnames(extrBg) <- c( "ID" , "predcicted")
$observed <- 0
extrBg
# bind both dataframes together:
<- rbind(extrTest, extrBg)
testData_boyce #rm(extrTest, extrBg,bg)
# calculate AUC and MAE
<- Metrics::auc(actual = testData_boyce$observed, predicted = testData_boyce$predcicted)
AUC
print(AUC)
## [1] 0.9797
print(boyceIndex)
## [1] 0.687
After testing the model on one single species we can go for a larger prediction using more different species and a for loop for modeling.
We load all the needed data again and use a filter for the amount of different species, for demonstrating we use a threshold of 50 presence point per species. If You have a powerful computer You can use lower thresholds but in that case the computation time is longer.
rm(list=ls()) # first we clean again the workspace
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 5691539 304 10985567 586.7 7968538 425.6
## Vcells 9429096 72 279432890 2132.0 436477280 3330.1
Setting the working directory
= "C:\\UNI\\SoSe_23\\SDM\\workflowSDM"
wd setwd(wd)
Load the data
<- read.csv(file.path(wd, "data", "PakistanLadakh.csv"))
data colnames(data) <- c("species","x","y")
Filter the data
<- table(data$species)
speciesButterflies
<- 50 # Threshold can vary, depends on the power of the used pc hardware
x <- names(speciesButterflies[speciesButterflies >= x | speciesButterflies == 0])
filteredSpecies
<- data[data$species %in% filteredSpecies, ] data
# We are transforming the data.frame again into an sf object
<- sf::st_as_sf(data, coords=c("x", "y"), remove=F, crs=sf::st_crs("epsg:4326")) data
Loading the environmental data raster from the first modeling
<- terra::rast(file.path(wd, "data", "bioclimPak.tif")) raster
Prepairing everything for the loop function
To find a threshold for the prediction we use a function developed by Cecina Babich Morrow, for more information you can click on the following link: Thresholding species distribution models
# function of Cecina Babich Morrow
<- function(sdm, occs, type = "mtp", binary = FALSE){
sdm_threshold <- raster::extract(sdm, occs)
occPredVals if(type == "mtp"){
<- min(na.omit(occPredVals))
thresh else if(type == "p10"){
} if(length(occPredVals) < 10){
<- floor(length(occPredVals) * 0.9)
p10 else {
} <- ceiling(length(occPredVals) * 0.9)
p10
}<- rev(sort(occPredVals))[p10]
thresh
}<- sdm
sdm_thresh < thresh] <- NA
sdm_thresh[sdm_thresh if(binary){
>= thresh] <- 1
sdm_thresh[sdm_thresh
}return(sdm_thresh)
}
After that we set some parameters for the loop and create some lists to store the results of each model and prediction.
<- 100# we set the sample for absence points to 100 for faster computation time
numSamples
`%not_in%`<- Negate(`%in%`)# part of the data splitting function
# lists to store the results while the loop is running
<- list()# model of every unique species
models <- list()# predictions of every unique species
predictions <- list()# boyce Indices
boyceIndices <- list()# auc values
aucValues <- list()# mean absolute error for evaluation
maeValues <- list()# the predicted rasters with the threshold of Cecina Babich Morrow sdmMTP
Now it’s time for the loop.
for (i in filteredSpecies) {
<- subset(data, species %in% i)# subset for every unique species in the data
dataSubset
<- terra::extract(raster, dataSubset)
dataSpeciesExtr <- cbind(dataSubset, dataSpeciesExtr)
dataSpecies $occ <- 1# setting the occurance data to 1
dataSpecies
# splitting the data in training and testing
set.seed(2023)
<- dplyr::slice_sample(dataSpecies, prop=.25)# 25% for testing data, rest for training
testData <- na.omit(testData)
testData <- dplyr::filter(dataSpecies, ID %not_in% testData$ID)
trainDataMTP <- trainDataMTP %>% select(-species)
trainData
# making absence points
<- sf::st_as_sf(as.data.frame(dismo::randomPoints(mask=raster::stack(raster),
absenceSamples n=numSamples)), # number of samples
crs=terra::crs(raster),
coords=c("x","y"), remove=F)
<- terra::extract(raster, absenceSamples)
absenceValuesExtr <- cbind(absenceSamples, absenceValuesExtr)
absenceValues $occ <- 0# setting the absence data to 0
absenceValues
<- rbind(trainData, absenceValues) %>% as.data.frame ()%>% dplyr::select(-geometry)
trainData
<- caret::train(occ ~ bio_1 + bio_2 + bio_3 + bio_4 + bio_5 + bio_6 + bio_7 + bio_8 + bio_9 + bio_10
model + bio_11 + bio_12 + bio_13 + bio_14 + bio_15 + bio_16 + bio_17 + bio_18 + bio_19,
data = trainData,
method = "cforest",
controls = cforest_unbiased(mtry = 1, ntree = 50),
na.action = na.exclude)
<- model
models[[i]] <- paste0(i, "_model_1_50.rds")
modelFile ::saveRDS(model, file.path(wd, "models", file = modelFile))
terra
<- terra::predict(object = raster,
predict model = model,
OBB = TRUE,
na.rm = TRUE,
cores = 4,
cpkgs = "party",
se.fit=TRUE)
<- predict
predictions[[i]]
<- paste0(i, "_predict_1_50.tif")
predictionFile ::writeRaster(predict, file.path(wd, "predictions", file = predictionFile), overwrite=TRUE)
terra
# calculating the boyce index
<- ecospat::ecospat.boyce(fit = raster(predict), obs = testData)
boyceIndex <- boyceIndex$cor
boyceIndices[[i]]
# calculate the AUC and MAE
<- terra::extract(predict, testData)
extrTest colnames(extrTest) <- c("ID", "predicted")
$observed <- 1
extrTest
<- terra::extract(predict, absenceValues)
extrAv colnames(extrAv) <- c("ID", "predicted")
$observed <- 0
extrAv<- na.omit(extrAv)
extrAv
<- rbind(extrTest, extrAv)
testData
<- Metrics::mae(actual = testData$observed, predicted = testData$predicted)
maeValue <- maeValue
maeValues[[i]] <- Metrics::auc(actual = testData$observed, predicted = testData$predicted)
aucValue <- aucValue
aucValues[[i]]
# predictions with the Morrow threshold
<- sdm_threshold(raster(predict), trainDataMTP[,2:3], "mtp", binary = TRUE)
sdm <- sdm
sdmMTP[i] <- paste0(i, "_predictThres_1_50.tif")
predictThresFile ::writeRaster(sdm, file.path(wd, "predictions", file = predictThresFile), overwrite=TRUE)
raster }
For Evaluation we use the Boyce Index and the AUC again and additional the MAE (Mean Absolute Error). Therefore we make a summary data.frame and store every value from each prediction in it. The summary shows a high variance in the Boyce Index if You look at the the table. This might be the effect of a poor number of presence points collected in the field work. So the models are strictly limited to the amount of presence points.
<- data.frame(species = filteredSpecies,
summary BoyceIndex = unlist(boyceIndices),
AUC = unlist(aucValues),
MAE = unlist(maeValues))
saveRDS(summary, file = "predictions/summary_1_50.Rds") # Save data.frame
First load the saved summary data.
<- readRDS(file = "C:\\UNI\\SoSe_23\\SDM\\workflowSDM\\predictions\\summary_1_50.Rds") summary
Now we can plot the evaluation summary by using the gt package.
# for this plot we use the gt package
%>%
summary gt() %>%
tab_header(title = md("Summary of the evaluation indices")) %>%
fmt_number(
columns = c(BoyceIndex, AUC, MAE),
decimals = 3# for a better overview we set the decimal values to 3
%>%
) tab_style(
style = list(cell_fill(color = "lightblue"),
cell_text(weight = "bold")),
locations = cells_body(columns = c(BoyceIndex, AUC, MAE))) %>%
tab_style(
style = list(cell_fill(color = "lightyellow"),
cell_text(weight = "bold", style = "italic")),
locations = cells_body(columns = species)
)
Summary of the evaluation indices | |||
species | BoyceIndex | AUC | MAE |
---|---|---|---|
Aglais_caschmirensis | 0.467 | 0.944 | 0.150 |
Anaphaeis_aurota | 0.872 | 0.898 | 0.231 |
Colias_erate | 0.543 | 0.896 | 0.236 |
Colias_fieldii | 0.835 | 0.908 | 0.225 |
Danaus_chrysippus | 0.919 | 0.832 | 0.253 |
Eurema_hecabe | 0.749 | 0.845 | 0.197 |
Gonepteryx_nepalensis | 0.827 | 0.924 | 0.192 |
Junonia_orithya | 0.775 | 0.854 | 0.177 |
Lycaena_phlaeas | 0.829 | 0.941 | 0.195 |
Papilio_demoleus | 0.457 | 0.842 | 0.171 |
Papilio_machaon | 0.687 | 0.927 | 0.194 |
Pieris_brassicae | 0.814 | 0.880 | 0.212 |
Pieris_canidia | 0.694 | 0.937 | 0.163 |
Pontia_daplidice | 0.847 | 0.944 | 0.140 |
Vanessa_cardui | 0.875 | 0.820 | 0.239 |
As a last part we load the saved predictions evaluated by the Morrow threshold function and prepare the data to plot a species richness map.
<- stack(sdmMTP)# make one stack raster with the raster stack function
sdmStack # stackApply function to stack and count the predictions of the unique species in a binary format, 1 for predicted occurence and 0 for absence
<- stackApply(sdmStack, indices = c(1), fun = sum, na.rm = TRUE)
sdmStack
# save the raster for later use
::writeRaster(sdmStack, file.path(wd, "predictions", "sdmStack_1_50.tif"), overwrite=T) raster
We read again the raster stack and plot with mapview.
<- terra::vect(file.path(wd, "data", "gadmPak.gpkg"))# first we load gadm of Pakistan
gadmPak <- sf::st_as_sf(gadmPak) gadmPak
There You can see that most of the species are predicted in the north of Pakistan, so that’s what we expected from the model. All in all it looks logical. If You want to improve the results You can go for hyperparameter tuning. For cforest it is recommended in the caret package to play with the mtry values. But this takes a lot more computation time so we skipped this part.
<- raster::raster(file.path(wd, "predictions","sdmStack_1_50.tif"))
sdmStack <- raster::mask(sdmStack, gadmPak)# mask the predictions to the border of Pakistan
sdmStackMask <- mapview(sdmStackMask, zcol = "values", layer.name = "Species richness map1", col.regions = RColorBrewer::brewer.pal(11, "YlOrRd"), map.types = "Esri.WorldImagery", alpha.regions = 1)+
map1 mapview(gadmPak, alpha = 0.5, alpha.regions = 0.0, layer.name = "Border and districts of Pakistan", color = "blue")
map1
Finally, let’s plot a map to compare the predicted values with the map predicted by Morrow. Therefore we set our threshold to 0.5.
= 0.5# setting a threshold for the binary sdm map
threshold
= list()
binarySpecies # function to stack all the binary predicted values into one list
<- c(0, threshold, 0,
m 1, 1)
threshold, <- matrix(m, ncol=3, byrow = TRUE)
m
for (i in filteredSpecies) {
<- terra::rast(predictions[i])
r <- terra::classify(r,
binarySpecies[i]
m, include.lowest = TRUE)
rm(r)
}# now we stack the SpatRasters stored in the list into one single SpatRaster
<- terra::rast(binarySpecies)
stack <- sum(stack)
stack # transform the SpatRaster into a Large RasterLayer by using raster package to plot it with mapview
# this task is not necessary, You can also do a simple plot of the stack by using the plot function
<- raster::raster(stack)
stackedRaster # save the raster for later use
::writeRaster(stackedRaster, file.path(wd, "predictions", "stackedRaster_1_50.tif"), overwrite=T) raster
# load the stacked RasterLayer
<- raster::raster(file.path(wd, "predictions","stackedRaster_1_50.tif"))
stackedRaster <- raster::mask(stackedRaster, gadmPak)# mask the predictions to the border of Pakistan
stackedRaster <- mapview(stackedRaster, zcol = "values", layer.name = "Species richness map2", col.regions = RColorBrewer::brewer.pal(11, "YlOrRd"), map.types = "Esri.WorldImagery", alpha.regions = 1)+
map2 mapview(gadmPak, alpha = 0.5, alpha.regions = 0.0, layer.name = "Border and districts of Pakistan", color = "blue")
map2
Bellard, C., Thuiller, W., Leroy, B., Genovesi, P., Bakkenes, M. &
F. Courchamp (2013): Will Climate change promote future invasions.
Global Change Biology, Volume 19, Issue 12. Page: 3742.
BREIMAN, L. (2001): Random Forests. Machine Learning. 45: 5–32.
HOTHORN, T., HORNIK, K. & A. ZEILEIS (2006): Unbiased Recursive
Partitioning: A Conditional Inference
Framework. Journal of Computational and Graphical Statistics 15/3:
651–674.
HOTHORN, T., LAUSEN, B., BENNER, A. & M. RADESPIEL-TRÖGER (2004):
Bagging survival trees. Statistics in Medicine 23: 77–91.
Janitza S., Strobl C. & A.-L. Boulesteix (2013) An AUC-based
permutation variable importance measure for random forests.
Bioinformatics, 14:119.
NEMBRINI, S., KÖNIG, I. R. & M. N. WRIGHT (2018): The revival of the
Gini importance? Bioinformatics 34: 3711–3718.
STROBEL, C., MALLEY, J. & G. TUTZ (2009a): An Introduction to
Recursive Partition. Munich.
STROBEL, C., HOTHORN, T. & A. ZEILEIS (2009b): Party on! A New,
Conditional Variable Importance Measure for Random Forests Available in
the party Package. Munich.
STROBEL, C., BOULESTEIX, A.-C., ZEILEIS, A. & T. HOTHORN (2007):
Bias in random forest variable importance measures: Illustrations,
sources and a solution. Bioinformatics 8/25: o. A. WORLDCLIM (2022):
Bioclimatic variables. https://worldclim.org/data/bioclim.html (Zugriff:
03.06.2023).
Figure 1:
Aglais
Cashmirensis (Zugriff: 07.06.2023)
Figure 2: Own representation
Figure 3: Own representation
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.utf8
##
## attached base packages:
## [1] stats4 grid stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] leaflet.extras2_1.2.1 gt_0.9.0 maptiles_0.5.0
## [4] leaflet_2.1.2.9000 ecospat_3.5.1 Metrics_0.1.4
## [7] party_1.3-13 strucchange_1.5-3 sandwich_3.1-0
## [10] zoo_1.8-12 modeltools_0.2-23 mvtnorm_1.2-2
## [13] mapview_2.11.0 dplyr_1.1.2 dismo_1.3-14
## [16] raster_3.6-20 sp_2.0-0 geodata_0.5-8
## [19] terra_1.7-37 sf_1.0-13 caret_6.0-94
## [22] lattice_0.21-8 ggplot2_3.4.2 knitr_1.43
##
## loaded via a namespace (and not attached):
## [1] uuid_1.1-0 backports_1.4.1 Hmisc_5.1-0
## [4] systemfonts_1.0.4 plyr_1.8.8 splines_4.2.2
## [7] crosstalk_1.2.0 listenv_0.9.0 TH.data_1.1-2
## [10] leafpop_0.1.0 digest_0.6.31 foreach_1.5.2
## [13] htmltools_0.5.5 earth_5.3.2 leaflet.providers_1.9.0
## [16] fansi_1.0.4 checkmate_2.2.0 magrittr_2.0.3
## [19] cluster_2.1.4 ks_1.14.0 recipes_1.0.6
## [22] globals_0.16.2 gower_1.0.1 matrixStats_1.0.0
## [25] svglite_2.1.1 hardhat_1.3.0 timechange_0.2.0
## [28] colorspace_2.1-0 xfun_0.39 leafem_0.2.0
## [31] jsonlite_1.8.5 libcoin_1.0-9 brew_1.0-8
## [34] survival_3.5-5 iterators_1.0.14 ape_5.7-1
## [37] glue_1.6.2 PresenceAbsence_1.1.11 gtable_0.3.3
## [40] ipred_0.9-14 webshot_0.5.4 future.apply_1.11.0
## [43] abind_1.4-5 scales_1.2.1 DBI_1.1.3
## [46] Rcpp_1.0.10 plotrix_3.8-2 htmlTable_2.4.1
## [49] units_0.8-2 foreign_0.8-84 proxy_0.4-27
## [52] mclust_6.0.0 Formula_1.2-5 lava_1.7.2.1
## [55] prodlim_2023.03.31 htmlwidgets_1.6.2 RColorBrewer_1.1-3
## [58] ellipsis_0.3.2 nabor_0.5.0 farver_2.1.1
## [61] pkgconfig_2.0.3 reshape_0.8.9 nnet_7.3-19
## [64] sass_0.4.6 utf8_1.2.3 labeling_0.4.2
## [67] tidyselect_1.2.0 rlang_1.1.1 reshape2_1.4.4
## [70] munsell_0.5.0 TeachingDemos_2.12 tools_4.2.2
## [73] cachem_1.0.8 xgboost_1.7.5.1 cli_3.6.1
## [76] generics_0.1.3 ade4_1.7-22 evaluate_0.21
## [79] stringr_1.5.0 fastmap_1.1.1 yaml_2.3.7
## [82] maxnet_0.1.4 ModelMetrics_1.2.2.2 purrr_1.0.1
## [85] randomForest_4.7-1.1 satellite_1.0.4 coin_1.4-2
## [88] future_1.32.0 nlme_3.1-162 xml2_1.3.4
## [91] pracma_2.4.2 biomod2_4.2-4 compiler_4.2.2
## [94] rstudioapi_0.14 png_0.1-8 e1071_1.7-13
## [97] tibble_3.2.1 bslib_0.5.0 stringi_1.7.12
## [100] highr_0.10 plotmo_3.6.2 poibin_1.5
## [103] Matrix_1.5-4.1 commonmark_1.9.0 markdown_1.7
## [106] classInt_0.4-9 vegan_2.6-4 gbm_2.1.8.1
## [109] permute_0.9-7 vctrs_0.6.3 pillar_1.9.0
## [112] lifecycle_1.0.3 jquerylib_0.1.4 data.table_1.14.8
## [115] R6_2.5.1 KernSmooth_2.23-21 gridExtra_2.3
## [118] parallelly_1.36.0 codetools_0.2-19 MASS_7.3-60
## [121] gtools_3.9.4 withr_2.5.0 multcomp_1.4-25
## [124] mgcv_1.8-42 parallel_4.2.2 rpart_4.1.19
## [127] timeDate_4022.108 class_7.3-22 rmarkdown_2.22
## [130] mda_0.5-4 pROC_1.18.2 lubridate_1.9.2
## [133] base64enc_0.1-3