1. What are Random Forests?

1.1 Understanding the Trees in the Forest



In order to understand Random Forests, we first have to understand Decision Trees. Decision Trees first make a statement and then make a decision based on whether the statement was True or False. When things are classified into categories it is called a Classification Tree and if numeric numbers are predicted its called a Regression Tree. (Genuer & Poggi 2020, p.9 f.)
Picture 1 Example for a Classification Tree Picture 1 Example for a Classification Tree






Picture 2 Example for a Regression Treebeware the numbers make no sense at all they are just there to show the principle
Picture 2 Example for a Regression Tree, beware the numbers make no sense at all they are just there to show the principle



Trees can also combine numeric data and True/False data. Also, the same decision variable can appear multiple time with different numeric thresholds and last but not least the same classification result can appear repeated times. The base of the tree is called the root, parts with descendants and predecessors branches or internal nodes and the final parts without descendants, leaves.


Picture 3 Another example for a Classification Tree with the Parts of the Tree (own picture based on https://pixabay.com/de/vectors/baum-stamm-bl%c3%a4tter-ge%c3%a4st-natur-576847/)
Picture 3 Another example for a Classification Tree with the Parts of the Tree (own picture based on https://pixabay.com/de/vectors/baum-stamm-bl%c3%a4tter-ge%c3%a4st-natur-576847/)

For the creation of a tree, you first have to grow(-ing) it meaning the continuous splitting of the data (increasing) homogenous groups. The next step is called Stopping and means that the splitting is stopped either because further splitting is impossible (no further improvements can be made) or the observations in a node would fall under a predefined minimum. In the last step you have to prune the tree. This is because at the maximal tree the variance is very high but the bias very low (bias means the deviation from the expected values). At the roods the variance is very low but the bias very high. Now Pruning means that you keep only the most important splits in order to prevent overfitting (meaning that the model is too adapted to the prediction data) and means that you stay between the two extremes. Anyways Decision Trees are prone to instability so if you change the root (the data) just a little bit the results will change very much. (Genuer & Poggi 2020, p.9 ff., p. 13f. p. 33f.)







1.2 Growing the forest



As you might have guessed the topic is called Random Forest. It was developed to improve the accuracy of the decision trees by creating multiple trees with random subsets of the data. The subsets have the same numbers of data points selected from the full dataset (so called bootstrap samples) and the data is reused later on in following trees. The majority of the data is used in the random subsets at least once (according to Cutler et. Al. approximately 63% [2007, p. 2784]). The observations that aren’t used in the subset samples are so called out-of-bag-observations resp. -samples. The trees are fully grown and then have to predict the out-of-bag (OOB) observations (often you use training data for this). Then the predicted class of an object is calculated with the majority vote of the prediction by the random decision trees for this object. This is also used for the out-of-bag Error, meaning the misclassified out-of-bag observations, which is an indicator for the accuracy of the model. (Cutler et. Al. 2007, p. 2784f.; Yiu 2019; o. A. 2021)

Picture 4 How Random Forests make a decision (Yiu 2019)



Random forests need absence points in order to work. There are essentially two types of absence points. True Absences, that appear outside of the potential distribution of the species, and Pseudo-Absences that are randomly sampled from the background without regards to the potential distribution of the species. The idea behind the random background samples is, that it shows the model the difference between the points resp. the climate the species prefer and the climates available in the area. In Reality Random Forests don’t always perform very good, which is often attributed to the huge amount of True or Pseudo-Absences which leads to imbalanced classes. Random Forests reacts very sensitive to this exact phenomenon. Also, false (Pseudo-)Absences (so called class overlap) can have a negative impact on the model’s quality so we will later use different methods to generate the Absence Points. (Barbet-Massin et. Al. 2012, p. 328; Valavi et. Al. 2020, p. 1732f.)

The trees in the random forest still have a low bias and high variance but by averaging the trees also means averaging of the variance. As a result, Random forests have a low bias and moderate variance. The advantages of the model are its high accuracy, it can handle many variables and can show the importance of different predictor variables. (Kho 2018; o.A. 2021)





2. Overview


2.1 Aim of this work


We want to create a species richness map of different butterfly species in Pakistan. For this we will use the Random Forest machine learning method. The Presence Data is already given and as predictors we will use solar radiation (kJ m-2 day-1), precipitation (mm), water vapor pressure (kPa), wind speed (m s-1) and temperature (°C) minimum, maximum and average. Butterflies are highly dependent on the climate conditions. That is a reason why they are often used as indicators for climate change. Of course, from species to species the climatic requirements vary very much. Even the climate down to the months is important. Other factors are important because butterflies often rely on their respective host plants. In our model we will only focus on these data as machine learning models can tend to overfit if you give them to many variables. Annual preceptation (BIO12), Temperature average (BIO1), min (BIO6) and max (BIO5) all come from the Bioclim data from Worldclim while Wind speed, solar radiation and water vapor pressure comes from Historical climate data from Worldclim. The spatial resolution is 30 seconds. (Saadat et. Al. 2016, p. 1963f.)

Additionally I wanted to implement the Random Forest method without using the raster or sp package. This very often used packages in R will by the end of 2023 not longer be supported and many models can’t be used after this anymore.



2.2 Loading the required packages


First of all, we have to load the packages. We are using rnaturalearth for the shape of Pakistan, terra, sf and dplyr for data management, ggplot2 for the mapping / plotting and last but not least the package randomForest for the creation of the random Forests themselves. Later ROCRwill be needed in order to calculate the Area under the curve. A measure for the performance of the model while the package DT is used to plot interactive data frames.
If you haven’t installed the packages yet you have the type in install.packages(“package name”) and press enter. Only after that you can use the library() command to load the packages. In this step I’m also determining the path to the working directory so I can use relative paths in order to enable people to recreate my script with the same data easily.

#loading needed packages
library("rnaturalearth")
library("terra")
library("sf")
library("ggplot2")
library("randomForest")
library("dplyr")
library("ROCR")
library("DT")

wd <- "D:/Studium/Species_modelling/bsc-species-distribution-modelling-2022-LeonUebe/Student_tutorial"
# defining the working directory
setwd("D:/Studium/Species_modelling/bsc-species-distribution-modelling-2022-LeonUebe/Student_tutorial")




3. Data preparation


3.1 Loading the Data


We first get our shape of Pakistan from the rnaturalearth package as a simplified object (sf-object). Then we load our existing observation data of the butterflies from a .csv-data and the climate date from worldclim. The worldclim includes the data for solar radiation (kJ m^(-2) day^(-1)), precipitation (mm), water vapor pressure (kPa), wind speed (m s^(-1)) and temperature (°C) minimum, maximum and average. It can be found under https://worldclim.org/data/worldclim21.html (I won’t run all the commands for the data preperation of the worldclim data here as this would take to long, in the end I created a file with the shown code and just import that)

# Loading the shape of Pakistan
shape_Pakistan <- ne_countries(scale = 50, type = "countries", country = "Pakistan", returnclass = "sf")

 # Loading the observation data, Attention! Before loading .csv you should look at them in a text editor to determine the separator and decimator 
butterfly_pakistan <- read.table(file.path(wd, "Data/PakistanLadakh.csv"), header = TRUE, sep = ",", dec = ".") 

# Loading the prediction values from Bioclim (for the example we will use 2.5 minutes but the later used data was generated with 30 seconds data)
prediction_value_tempavg <- rast(file.path(wd, "Data/Worldclim/wc2.1_2.5m_bio_1.tif"))
# prediction_value_tempmax <- rast(file.path(wd, "Data/Worldclim/wc2.1_30s_bio_5.tif"))
# prediction_value_tempmin <- rast(file.path(wd, "Data/Worldclim/wc2.1_30s_bio_6.tif"))
# prediction_value_prec <- rast(file.path(wd, "Data/Worldclim/wc2.1_30s_bio_12.tif"))

#  the Historical world clim data gives use for the packages all the values for the different months, we will read them into one Spatraster and later use tapp() command to get the average for the year, in this example I will show it with data of the size 2.5 minutes instead of the 30 seconds as the process of knitting would take to long and I will load the prepared data from the 30 second data later on 

prediction_value_vapr_raw <- c(rast(file.path(wd,"Data/Worldclim/Vapr/wc2.1_2.5m_vapr_01.tif")),
                               rast(file.path(wd, "Data/Worldclim/Vapr/wc2.1_2.5m_vapr_02.tif")),
                               rast(file.path(wd, "Data/Worldclim/Vapr/wc2.1_2.5m_vapr_03.tif")),
                               rast(file.path(wd, "Data/Worldclim/Vapr/wc2.1_2.5m_vapr_04.tif")),
                               rast(file.path(wd, "Data/Worldclim/Vapr/wc2.1_2.5m_vapr_05.tif")),
                               rast(file.path(wd, "Data/Worldclim/Vapr/wc2.1_2.5m_vapr_06.tif")),
                               rast(file.path(wd, "Data/Worldclim/Vapr/wc2.1_2.5m_vapr_07.tif")),
                               rast(file.path(wd, "Data/Worldclim/Vapr/wc2.1_2.5m_vapr_08.tif")),
                               rast(file.path(wd, "Data/Worldclim/Vapr/wc2.1_2.5m_vapr_09.tif")),
                               rast(file.path(wd, "Data/Worldclim/Vapr/wc2.1_2.5m_vapr_10.tif")),
                               rast(file.path(wd, "Data/Worldclim/Vapr/wc2.1_2.5m_vapr_11.tif")),
                               rast(file.path(wd, "Data/Worldclim/Vapr/wc2.1_2.5m_vapr_12.tif")))

# using tapp() to calculate the mean for the year in every place 
prediction_value_vapr <- tapp(prediction_value_vapr_raw, index = "numeric",fun=mean)


plot(prediction_value_vapr)


You can now see one nice single map with the mean water vapor pressure now, also I did this in the exact same way for srad and wind only did I hide the code so the tutorial doesn’t get to overcrowded




Let’s look on the data so far:

# the observation points combined with the shape of Pakistan
ggplot() + geom_sf(fill = "white", data = shape_Pakistan$geometry) + geom_point(data = butterfly_pakistan, size = 1, aes(x = x, y = y)) +  
labs(x = "Longitude", y = "Latitude", title = "Observation Points of Butterflies Raw")


And there is our first problems. As you can see not all observations of butterflies lie within the borders of Pakistan. Also the prediction values are still on their own and have the extent of the whole world. This leads us to our next step: Data Preparation.



3.2 Changing the projects coordinate system


Of course we can’t use the default coordinate system WGS 84 resp. EPSG:4326. In this step we will transform the coordinate system of the shape of Pakistan and the prediction values from EPSG:4326 (Area of Use: Whole World) to EPSG:4145 Kalianpur 1962 a coordinate system used for Pakistan. (Proj.4 code can be found under https://epsg.io/4145)


# defining the coordinate system (using the coordinate system EPSG:24313)
crs <- "+proj=longlat +a=6377301.243 +b=6356100.230165384 +towgs84=283,682,231,0,0,0,0 +no_defs"

# using the project() command from terra for Raster data
prediction_value_tempavg <- project(prediction_value_tempavg, crs)
# prediction_value_tempmax <- project(prediction_value_tempmax, crs)
# prediction_value_tempmin <- project(prediction_value_tempmin, crs)
# prediction_value_prec    <- project(prediction_value_prec, crs)
# prediction_value_vapr    <- project(prediction_value_vapr, crs)
# prediction_value_srad    <- project(prediction_value_srad, crs)
# prediction_value_wind    <- project(prediction_value_wind, crs)


# using the st_transform() command from the sf package for sf data 
shape_Pakistan <- st_transform(shape_Pakistan, st_crs(crs))


Nice now we have changed the coordinate system to a better fitting one.

3.3 Masking the Climatic Data

Up until now the prediction values are covering the whole world. Of course we only need the data for Pakistan. We will use the mask() function from terra for this and cut the raster to the data shape_pakistan.

# first we have to change to sf object to a Spatvector object in order to use it to cut out the prediction values
vect_Pakistan<- vect(shape_Pakistan)

# masking the prediction values
prediction_value_tempavg <- mask(prediction_value_tempavg, vect_Pakistan)

prediction_value_tempavg <- crop(prediction_value_tempavg, shape_Pakistan)
# cropping to the area of interest as the extent remained from the world map and we only want the extent of Pakistan
plot(prediction_value_tempavg)


Here you can finally see the average temperature for Pakistan. For all the other prediction variables you can do the same by exchanging prediction_value_tempavg to the according prediction_value_....

In the end you combine all of them to a Spatraster.

# giving a name to the Spatraster so they will show up once they are combined, , as already told they were generated in the same way as the example shown in prediction_value_tempavg

names(prediction_value_tempavg) <- "Annual_Mean_Temperature"
names(prediction_value_tempmax) <- "Max_Temperature_of_Warmest_Month" 
names(prediction_value_tempmin) <- "Min_Temperature_of_Coldest_Month"
names(prediction_value_prec) <- "Annual_Precipitation"
names(prediction_value_vapr) <- "Mean_water_vapor_pressure"
names(prediction_value_srad) <- "Mean_solar_radiation"
names(prediction_value_wind) <- "Mean_wind_speed"

# combining the prediction values

 predictions_values <- c(prediction_value_tempavg,
                        prediction_value_tempmax,
                        prediction_value_tempmin,
                        prediction_value_prec,
                        prediction_value_vapr,
                        prediction_value_srad,
                        prediction_value_wind)

# write out raster so we can reload it and don't have to run all the code when knitting 
writeRaster(predictions_values, file.path(wd,"/Data/Worldclim/predictions_values.tif"))
dev.off()
predictions_values <- rast(file.path(wd,"/Data/Worldclim/predictions_values.tif"))
plot(predictions_values)

As you can see the values of the predictors are in a realistic range so we will stick with them.

3.4 Masking the butterfly observation points


First, we will clip the data frame of the observations with the borders of Pakistan. We have to define our coordinate system and for this we use the coordinate system of the Pakistan data. Then we transform the butterfly_pakistan to an sf point object so the sf-package can use the data. After this we use the st_intersection()function to cut out all points outside of Pakistan.

# transforming the data frame to a sf object
sf_butterfly <- st_as_sf(butterfly_pakistan, coords = c("x", "y"), crs = crs) 

# cutting out the points outside of Pakistan
masked_butterfly <- st_intersection(sf_butterfly, shape_Pakistan) 

# returning the sf object to a data frame
df_bf <- as.data.frame(masked_butterfly) 

# getting the coordinates for the data frame as the command "xy = TRUE" in as.data.frame() didn't work here
coordinates_butterfly <- sf::st_coordinates(masked_butterfly)

# transform the matrix with the coordinates to a data
coords_df <- as.data.frame(coordinates_butterfly) 
butterfly <- cbind(df_bf$ï..species, coords_df) # combine the cut out observation points with the coordinates

# Plot the cut out observation points
ggplot() + geom_sf(fill = "white", data = shape_Pakistan$geometry) + geom_point(data = butterfly, size = 1, aes(x = X, y = Y)) + 
labs(x = "Longitude", y = "Latitude", title = "Cut out observation points ")

3.5 Preparing the data for the actual test


Now that we have nice clean data, we will start with one species in order to show how to create a random forest. Because random forests work with presence and absence data we have to add absence points. We don’t have these in the original survey so we have to use a trick. Its background absence points. These points don’t try to guess where the species are not but can characterize the environment. So, the resulting absence points can also appear where the species is found. This means that presence points should show in which areas and under which conditions it is more likely than average for the species to appear. According to a study of Barbet-Massin et. Al. the best ratio of presence to absence points in random forest is 1 (presence) to 10 (absence) points. As already told, this can lead to a bad performance of the model but we will return to that again. (Hijmans & Elith o.d., Absence Points; Barbet-Massin et. Al. 2012, p. 328 ff.)


We will use the species Colias erate here as an example species because it has the most presence points in the data set with 70 points of observation. We start by extracting the values for the presence points from the prediction raster by using the extract command. This will give us a data frame with the prediction values for the points where Colias erate was seen.

# using extract for the presence prediction values 
presence_points <- terra::extract(predictions_values, butterfly[butterfly$`df_bf$ï..species` == "Colias_erate",2:3])

# Using the DT package with its datatable function to plot a interactive data frame showing the ID row we don't want
datatable(presence_points)
# ... so we remove it
presence_points_minus_ID <- subset(presence_points, select = -ID)



For the absence points we determine a seed first in order to get reproducible results. Then we use the spatSample command to get the random absence points within the borders of Pakistan. We want to get an ratio of 1 to 10 so in the function we enter sum(butterfly species)*10 (this will be 700 for the start) as the amount of points we want to get generated.

# setting the seed and generate the absence points
set.seed(0308)
absence_points <- spatSample(predictions_values, sum(butterfly$`df_bf$ï..species` == "Colias_erate")*10, "random", na.rm=TRUE, as.df = TRUE)

# as you can see we don't have to use extract as the data is already in the generated Pseudo-absence Points
datatable(absence_points)



Now we will replace presence with Value 1 and absence with Value 0 and then combine both presence and absence data into one data frame.

# adding column that says 1 for presence data and 0 for absence data using the rep() (replicate function)
pres_abs <- c(rep(1, nrow(presence_points_minus_ID)), rep(0, nrow(absence_points))) 

# combining both
pres_abs_data <- data.frame(cbind(pres_abs, rbind(presence_points_minus_ID, absence_points)))

datatable(pres_abs_data)




As already mentioned, there are Random Forests for classification and regression. To keep it simple regression is used predict values of a dependant variable from one or more independent variables. For example, we could use it for income or the probability for the species to appear in a certain area according to the climate there. In contrast to that classification is used for mapping discrete output variables from input variables. In our example we want a prediction whether the species is there or not. So, we have to use classification. For this reason, we have to tell r that pres_abs is a factor otherwise it would use regression. (Gupta 2021)

pres_abs_data[ , "pres_abs"] <- as.factor(pres_abs_data[ , "pres_abs"])


For this example species, we will now split the data into a training and a testing set so we can test its performance later on. Normally with big data sets you would use a ratio between 80-90% (training) to 20-10% (testing) but because our sample is very small, we will use a ratio of 60% training to 40% testing data. You can use the sample.int function from base r for this. The function takes a specified number of elements (60% here) as sample from our presence-absence data. We use replace = FALSE so the testing and training data won’t overlap.
You can use the sample.int function from base r for this. The function takes a specified number of elements (60% here) as sample from our presence-absence data. We use replace = FALSE so the testing and training data won’t overlap.

# setting a seed again as the samples are randomly divided and create a sample
set.seed(0308)
sample <- sample.int(n = nrow(pres_abs_data), size = floor(0.6*nrow(pres_abs_data)), replace = FALSE) #because there are species with a very small sample size we will use a ratio of 60% training data to 40% testing data

# Using the 60% of the sample for the training set…
train <- pres_abs_data[sample, ]

# … and the exact opposite data for the train set
test <- pres_abs_data[-sample, ]

With this step we are finished with the Data preparation (at least for the classical classification tree).



4. Random Forest Modell generation


4.1 Creating our first Random Forest


For this step we will finally use the randomForest package. First, we have to determine a seed again so other people can recreate this.
pres_abs ~. is the command used for telling r it should classify the presence and absence (our pres_abs) with the climate data stored also in the variables train, test and pres_abs_data. Replace is True so all trees can randomly select data and won’t run out of it. For NA the program should use the average in the columns (na.roughfix) and the Importance of the predictors should be assessed. For starting we will generate 300 trees and allow r to use three variables (mtry = 3) randomly sampled as candidates at each split. Later on, we will optimize the tree.

# setting the seed and run the forest!
set.seed(0906)
prediction_model <- randomForest(pres_abs ~ ., data = train, replace = TRUE, importance = TRUE, mtry = 3, ntree = 300, na.action = na.roughfix, do.trace = FALSE)


Picture 6 OOB estimate of the first default Classification
Picture 6 OOB estimate of the first default Classification Tree


As you can see the program produced a classification tree with 300 Trees and tried 3 variables at each split just like we wanted it to. We also see for the first time the out-of-bag error. As already told this error is a indicator on how good a model performs. The top left corner means the data that was correctly classified as Absence Data while the bottom right is the correctly classified presence data. While the top right is the misclassified absence data (only 3,89%) while 71% of the presence data was misclassified. Before we jump to conclusions, we will test our model with our testing data.

# comparing the data from test with the predictions of the model
table_test <- table(test$pres_abs, predict(prediction_model, newdata=test))
table_test # top left is correctly classified absence data, top right absence data falsely classified as presence data, bottom left presence data falsely classified as absence data and bottom right is correctly classified presence data 
##    
##       0   1
##   0 269  11
##   1  20   8
# error 0; % of falsely classified absence data
table_test[3]/(table_test[1] + table_test[3]) 
## [1] 0.03928571
#error 1 % of falsely classified presence data
table_test[2]/(table_test[2] + table_test[4])
## [1] 0.7142857

This already looks better but still not perfect. I already wrote that Random Forests can perform really bad with class imbalances. So, let’s try the whole process again. Before we continue improving the model, we can take a peek on how the distribution map could look like later.

# we can use the predict() function to predict whether the species would appear in parts of Pakistan based on the climatic data there
prediction <- predict(predictions_values, prediction_model)

# because Random Forest changes our 1 for presence data to a 2 and the 0 for absence data to a 1 we will just change it back
prediction[prediction == 1] <- 0
prediction[prediction == 2] <- 1

# plot(prediction)

Picture 7 Predicted Distribution of Colias erate with a classical Random Forest (ntree = 300, mtry = 3 and Ratio Presence to Absence is 1:10); 1 means Species is predicted to be present and 0 means Species is predicted to be absent



With varImpPlot()we can look at our predictors and their importance.As you might see the solar radiation is the most important variable for the species here.

# The function varImpPlot can show us the importance of certain predictors for the model 
# randomForest::varImpPlot(prediction_model)


Picture 8 Importance of the different predictors



We will now change the generation of the absence points and get a ratio of presence to absence points of 1 to 1. We do this to see whether the first model performed so bad because of the class imbalance between the classes:

# Let’s just remove the 0
set.seed(0308)
absence_points <- spatSample(predictions_values, sum(butterfly$`df_bf$ï..species` == "Colias_erate")*1, "random", na.rm=TRUE, as.df = TRUE)
# and then we just reload the whole process. 
pres_abs <- c(rep(1, nrow(presence_points_minus_ID)), rep(0, nrow(absence_points))) 
pres_abs_data <- data.frame(cbind(pres_abs, rbind(presence_points_minus_ID, absence_points)))
pres_abs_data[ , "pres_abs"] <- as.factor(pres_abs_data[ , "pres_abs"])
set.seed(0308)
sample <- sample.int(n = nrow(pres_abs_data), size = floor(0.6*nrow(pres_abs_data)), replace = F)
train <- pres_abs_data[sample, ]
test <- pres_abs_data[-sample, ]
set.seed(0906)
prediction_model_1to1 <- randomForest(pres_abs ~ ., data = train, replace = TRUE, importance = TRUE, mtry = 3, ntree = 300, na.action = na.roughfix, do.trace = FALSE)


Picture 9 Estimated OOB-Error for the classical Random Forest with the Ratio of 1 to 1

table_test <- table(test$pres_abs, predict(prediction_model_1to1, newdata=test))
table_test
##    
##      0  1
##   0 18  4
##   1  7 27

You can quickly see that the misclassification of the new model is much lower. Of course, the error rate got bigger but this is because of misclassified absence points that were randomly sampled and can land in areas which are in reality suitable for Colias erate or other species. In reality we get another problem here. The data has 429 different butterfly species. 210 of them have less then 10 observation points which means with a ratio of 1:1 it is impossible for the model to know most of the climate area.

Let’s look at both maps compared to each other:

prediction_1to1 <- predict(predictions_values, prediction_model_1to1)
prediction_1to1 [prediction_1to1  == 1] <- 0
prediction_1to1 [prediction_1to1  == 2] <- 1
# plot(prediction, xlab = "Longitude", ylab = "Latitude")
# plot(prediction_1to1, xlab = "Longitude", ylab = "Latitude")
# to be honest normally you would add the titles of the axis with the xlab and ylab command. In my case this sadly didn' t work so I added it manually.  

Picture 7 (again) Predicted Distribution of Colias erate with a classical Random Forest (ntree = 300, mtry = 3 and Ratio Presence Absence 1:10)

Picture 10 Predicted Distribution of Colias erate with a classical Random Forest (ntree = 300, mtry = 3 and Ratio Presence Absence 1:1



The area the model predicts as suitable for the species greatly increased. This can most likely be attributed to the increased number of areas predicted as suitable for the species. For this particular species we might can reduce the number of Pseudo-Absence-Points but as already told there are very small subsets of species in the data set. Normally you want enough absence points to represent all environments in the region. Class imbalance is a big problem here because it leads to unequal representation of the classes and class overlap. Class overlap means that the (randomly sampled) absence data appears in areas where the species could actually live. This of course is bad for the training model which now thinks that an actually suitable area for the species is an unsuitable it. (Valavi et. Al. 2020, p.1731 ff.)




4.2 Tuning the Class Imbalance


We will now first look how to solve the class imbalance problem and then look on different methods to generate absence points in order to prevent class overlap: There are quite many proposed solutions to the class imbalance and class overlap problems but we will focus now only on Weighted, Regression and Down Sampling Random Forests. For further reading I recommend the paper “Modelling species presence-only data with random forests” from Valavi et. Al. 2020 where they explain how to implement all the proposed solutions in R with two different packages which I used for my models in the following. The paper can be read online under https://onlinelibrary.wiley.com/doi/epdf/10.1111/ecog.05615

We will compare the methods by their estimated OOB errors and the errors calculated from the prediction of the Test data. We cannot use precision of the model as this indicator is prone to class imbalance. Additionally we will calculate the Area under the curve (AUC) for the receiver operating characteristic curve (ROC). This is a often used to measure the the discrimination of models and how good they are in separating between the classes. As you can see in picture 11 the ROC curve is build by labeling the True positive and False negative rate against each other using the data points. The AUC measures the amount of points over the ROC curve. In theory a perfect model would have a True positive rate of 1 while a random classifier has a rate of 0.5. This is also means the closer our AUC is to 1 the better performs the model. (Valavi et. Al. 2020, p. 1737f.)


Picture 11 What is the AUC (Source https://commons.wikimedia.org/wiki/File:Roc_curve.svg)



Calculating the AUC
For the code calculating the AUC I used code published in a thread in StackExchange which can be found under https://stats.stackexchange.com/questions/308645/why-does-randomforest-has-higher-test-auc-than-train-auc-is-this-possible

# the argument "type = prob" will give us the probability of the presence and (according to that) the probability of absence for the species because we only want the presence probability we will take column 2 directly
prob_test <- predict(prediction_model, type="prob",newdata = test)[,2]

# here we use the prediction function of the ROCR package which will give us a ROCR prediction function
pred_rocr_test <- prediction(prob_test, test$pres_abs)

# for this package we have to use @ in order to access parts of it, thats why we can access the AUC value with @y.values
auc_test <- performance(pred_rocr_test, measure = "auc")@y.values[[1]] 
auc_test
## [1] 0.9258021

This AUC is not too bad but of course we can do far better than this. Remember “correctly” classified Pseudo Absence Data can give us a headache in this situation as as they are also counted as True positive.


In the following we will use the ratio of 1:10 for the presence data and absence data as we want to grasp all the climatic areas in Pakistan. Also, we will use 1000 trees per Forest to be sure that no further improvement can be reached at this point. First we have to run the generation of the absence points again to get the ratio of 1:10.


set.seed(0308)
absence_points <- spatSample(predictions_values, sum(butterfly$`df_bf$ï..species` == "Colias_erate")*10, "random", na.rm=TRUE, as.df = TRUE)

presence_points_minus_ID <- subset(presence_points, select = -ID) # because we were using a data frame here to extract the data for the coordinates an ID row was created. We have to remove the row so we can later combine the absence and presence data into one data frame. 

pres_abs <- c(rep(1, nrow(presence_points_minus_ID)), rep(0, nrow(absence_points))) # adding column that says 1 for presence data and 0 for absence data
pres_abs_data <- data.frame(cbind(pres_abs, rbind(presence_points_minus_ID, absence_points))) # combining both

pres_abs_data[ , "pres_abs"] <- as.factor(pres_abs_data[ , "pres_abs"])

set.seed(0308)
sample <- sample.int(n = nrow(pres_abs_data), size = floor(0.6*nrow(pres_abs_data)), replace = FALSE) #because there are species with a very small sample size we will use a ratio of 60% training data to 40% testing data
train <- pres_abs_data[sample, ]
test <- pres_abs_data[-sample, ]




4.2.1 Redoing the Classical Model


First, we have of course our standard classification model. It actually deteriorated because we increased the number of trees and in the beginning of modelling, there is a big fluctuation of the model (we will tune this later on).

set.seed(0906)
prediction_model <- randomForest(pres_abs ~ ., data = train, replace = TRUE, importance = TRUE, mtry = 3, ntree = 1000, na.action = na.roughfix, do.trace = FALSE)
prediction_model
## 
## Call:
##  randomForest(formula = pres_abs ~ ., data = train, replace = TRUE,      importance = TRUE, mtry = 3, ntree = 1000, do.trace = FALSE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 10.39%
## Confusion matrix:
##     0  1 class.error
## 0 405 15  0.03571429
## 1  33  9  0.78571429
table_test_classic <- table(test$pres_abs, predict(prediction_model, newdata=test))
table_test_classic
##    
##       0   1
##   0 269  11
##   1  20   8
# error 0; % of falsely classified absence data)
table_test_classic[3]/(table_test_classic[1] + table_test_classic[3])
## [1] 0.03928571
#error 1 % of falsely classified presence data
table_test_classic [2]/( table_test_classic [2] + table_test_classic [4])
## [1] 0.7142857
# calculating the AUC
prob_test <- predict(prediction_model, type="prob",newdata = test)[,2]
pred_rocr_test <- prediction(prob_test, test$pres_abs)
auc_test <- performance(pred_rocr_test, measure = "auc")@y.values[[1]] 
auc_test
## [1] 0.8748724
# creating a data frame to later compare the different methods: We will collect the method, the estimates for the errors and the error rates for the test data sat. 
RandomForestmethods <- data.frame("method" = "Classical", "estimate OOB" = prediction_model$err.rate[1000, 1], "estimate Error 0" = prediction_model$err.rate[1000, 2], "estimate Error 1" = prediction_model$err.rate[1000, 3], "Error 0 for test"= table_test_classic[3]/(table_test_classic[1] + table_test_classic[3]), "Error 1 for test" = table_test_classic [2]/( table_test_classic [2] + table_test_classic [4]), "AUC" = auc_test)

# For the following methods the data frame is created in the exact same way and then added to this existing data frame by using rbind()




4.2.2 Weighted Method


The first alternative method we are going to try is the Weighted method. This method works as the name might tell you by applying weights to the different classes and increasing the so-called cost of misclassification for the minority class. In randomForest the weights are applied to the Gini index which impacts the selection of splits and importance of different nodes in the tree. (Valavi et. Al. 2020, p. 1735)

I will only show the important steps as all the steps before are identical between all of these methods.

# calculating the weights for the classes: prob.table() gives us the ratio of presence to absence data in train data then 1 is devided by this and we get weights for the classes (1 for absence here and 11 for presence data)

cwt <- 1 / prop.table(table(train$pres_abs))
# the weights for the classes
cwt
## 
##    0    1 
##  1.1 11.0
# the only difference to the classical model now is the new argument classwt which tells random Forest the weights for the different classes with our just calculated cwt
set.seed(0906)
prediction_model_weighted <- randomForest(pres_abs ~ ., data = train, replace = TRUE, importance = TRUE, mtry = 3, ntree = 1000, na.action = na.roughfix, do.trace = FALSE, classwt = cwt)
prediction_model_weighted
## 
## Call:
##  randomForest(formula = pres_abs ~ ., data = train, replace = TRUE,      importance = TRUE, mtry = 3, ntree = 1000, do.trace = FALSE,      classwt = cwt, na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 9.96%
## Confusion matrix:
##     0  1 class.error
## 0 406 14  0.03333333
## 1  32 10  0.76190476
table_test <- table(test$pres_abs, predict(prediction_model_weighted, newdata=test))
table_test
##    
##       0   1
##   0 272   8
##   1  18  10
# error 0; % of falsely classified absence data)
table_test[3]/(table_test[1] + table_test[3])
## [1] 0.02857143
#error 1 % of falsely classified presence data
table_test[2]/(table_test[2] + table_test[4])
## [1] 0.6428571
# calculating the AUC
prob_test <- predict(prediction_model_weighted, type="prob",newdata = test)[,2]
pred_rocr_test <- prediction(prob_test, test$pres_abs)
auc_test <- performance(pred_rocr_test, measure = "auc")@y.values[[1]] 
auc_test
## [1] 0.8806122



Well, we can already see that the Weighted model in our case performs nearly as bad as the original model so we have to continue searching a better model.




4.2.3 Using Regression Random Forests


According to Valavi et. Al. (2020, p.1736) even though normally Classification Random Forests (Random forest consisting of classification Trees) are used for modelling binary data, Regression Random Forests (consisting of regression Trees) could be better at predicting the class probabilities. In the Regression Random Forests we can’t use the OOB-Error as an indicator because Mean of squared residuals and percentage of variance explained by the OOB predictions are used for this task.
To be able to use Regression Random Forest we can use the exact same code as for the classical one. because the randomForest() functions decides for regression or classification based on the data type it has to process, so we just have to change the pres_abs part of the data back to a numeric value.

# transforming it back to a numeric value
train$pres_abs <- as.numeric(as.character(train$pres_abs))


set.seed(0906)
prediction_model_regression <- randomForest(pres_abs ~ ., data = train, replace = TRUE, importance = TRUE, mtry = 3, ntree = 1000, na.action = na.roughfix, do.trace = FALSE)



If we now predict this model for example for the test data set it would give us the probability of the species to appear there. In order to get binary values we have to tell R at which point it should say its presence or absence data. This is a so called threshold. For starting we will tell R to use points with more than 50% probability as presence points and points with less then 50% as absence points.

 # getting the prediction probability for appearance in a certain place
thresholdsetting <- predict(prediction_model_regression, newdata=test)

# changing every entry with a chance over 50% to 1 (presence) and everything with less then 50% chance to 0 (absence)
thresholdsetting[thresholdsetting >= 0.50001] <- 1
thresholdsetting[thresholdsetting < 0.50] <- 0

table_test_regression <- table(test$pres_abs, thresholdsetting)
table_test_regression # Test Table for the Regression method
##    thresholdsetting
##       0   1
##   0 270  10
##   1  19   9
table_test_classic # Table test for our classical classification Random Forest
##    
##       0   1
##   0 269  11
##   1  20   8
prob_test <- predict(prediction_model_regression,newdata = test)
pred_rocr_test <- prediction(prob_test, test$pres_abs)
auc_test <- performance(pred_rocr_test, measure = "auc")@y.values[[1]] 
auc_test
## [1] 0.8826531


Now by accident we learned something about the random forest algorithm. By setting the threshold to 50% we actually get nearly the exact same result as the Classification Tree. In many packages you can perform a Regression Random Forest and then later determine the threshold. For the following we will keep continuing to use the Classification Random Forest but keep in mind you can change this threshold on your own all the time. For the rest the threshold will stay at the default 50% here. You can also see when calculating the AUC that even without setting the threshold (and predicting the test data with the regression model) Classical and Regression methods perform nearly equally.


So let’s try our fourth method: Down-Sampling

4.2.4 Down Sampling


Down sampling or sometimes so called balanced random forests use for each tree a predefined ratio (in our case 1:1) for the presence and absence points. This method randomly takes samples of the generated absence points for each tree contrary to generating exactly as many absence points as presence points, like I already showed you using the classical tree. This also means that hundreds of trees use much more different absence points in the trees throughout Pakistan and the model can better understand the climatic areas in the whole country. (Valavi et. Al. 2020, p. 1736)

#return the training data back to factor from numeric
train$pres_abs <- as.factor(as.character(train$pres_abs))

# getting the number of presence records
prNum <- as.numeric(table(train$pres_abs)["1"]) 

# sample size for both classes
spsize <- c("0" = prNum, "1" = prNum) 
spsize # here you can see each class is supposed to be able to use 42 points in each tree 
##  0  1 
## 42 42
print(spsize)
##  0  1 
## 42 42


Now Random Forest is being told with spsize to use 42 presence points (all that we have) and maximum 42 absence points for one tree. We also only have to add one argument to our first Random Forest which is telling random forest to use spsize as the sample size.


set.seed(0906)
# saying R it should use spsize as the sample size for the different classes
prediction_model_downsampling <- randomForest(pres_abs ~ ., data = train, replace = TRUE, importance = TRUE, mtry = 3, ntree = 1000, na.action = na.roughfix, do.trace = FALSE, sampsize = spsize)

table_test <- table(test$pres_abs, predict(prediction_model_downsampling, newdata=test))
table_test
##    
##       0   1
##   0 238  42
##   1   6  22
# error 0
table_test[3]/(table_test[1] + table_test[3])
## [1] 0.15
#error 1
table_test[2]/(table_test[2] + table_test[4])
## [1] 0.2142857
# calculating the AUC
prob_test <- predict(prediction_model_downsampling, type="prob",newdata = test)[,2]
pred_rocr_test <- prediction(prob_test, test$pres_abs)
auc_test <- performance(pred_rocr_test, measure = "auc")@y.values[[1]] 
auc_test
## [1] 0.8954719
# Using the DT package with its datatable function to plot a interactive data frame
datatable(RandomForestmethods)


The Down Sampling Method performed best in AUC and even more importantly in the Error 1. All the other models had very big problems to determine presence points. To be fair the classical model with a ratio of 1:1 in presence and absence points performs in a similar way but there will be many species with much less data points then Colias erate later and there we want to have more overview for the Random Forest over the climate of Pakistan.
Let’s have a overlook over all the maps

# first we predict all the differnt methods to the predictors, than we have again to change the numbers and lastly we give them a name so we can distinguish between them later

prediction <- predict(predictions_values, prediction_model)
prediction [prediction  == 1] <- 0
prediction [prediction  == 2] <- 1
names(prediction) <- "Classification Method"

prediction_weighted <-predict(predictions_values, prediction_model_weighted)
prediction_weighted [prediction_weighted  == 1] <- 0
prediction_weighted [prediction_weighted  == 2] <- 1
names(prediction_weighted) <- "Weighted Method"

# for the regression method we don't have to change the areas as R will tell us the probability for the species to exist there

prediction_regression <-predict(predictions_values, prediction_model_regression)
names(prediction_regression) <- "Regression Method"

prediction_downsampling <-predict(predictions_values, prediction_model_downsampling)
prediction_downsampling [prediction_downsampling  == 1] <- 0
prediction_downsampling [prediction_downsampling  == 2] <- 1
names(prediction_downsampling) <- "Down Sampling Method"

prediction_methods <- prediction
prediction_methods <- terra::`add<-`(prediction_methods, prediction_weighted)
prediction_methods <- terra::`add<-`(prediction_methods, prediction_regression)
prediction_methods <- terra::`add<-`(prediction_methods, prediction_downsampling)

plot(prediction_methods)

Methods of generating Random Forests, x axis = Longitude, y axis = Latitude


Valavi et. Al. showed similar results using this methods. The Classical, Weighted and the Regression performed similarly while the best results were generated by the Down sampling method and Shallow (tuned) Trees. To keep it simple Shallow Trees functions like pruning decision trees. This means you stop the trees from growing to big and give them a splitting criteria robust to imbalanced data. Thereby you can prevent class overlap with a max tree size and reduce the effects of the imbalanced data with the splitting criteria. (Valavi et. Al. 2020, p. 1736)
I didn’t implement this method in this tutorial as the ranger package (also a Random forest package) is used for this and I wanted to stick to one Random Forest package. But if you are interested in the implementation you can read it up in the Supporting information of Valavi et. Al..



Picture 12 The spatial prediction of different RF implementations on a virtual species by Valavi et. Al.. The ‘r’ shows the correlation between the virtual species (Truth) and models predictions. For ease of comparison, the likelihood of all the maps were linearly rescaled between 0 and 1, so if there are a few high predictions and the rest is low, the map will be pale (i.e. overfitting). (Valavi et. Al. 2020, p. 1739)


As you can see in picture 12 the Down sampling method performs best in predicting the actual range of the (virtual) species in Valavi et. Al. study. Also in our tests the Down sampling performed best so we will stick to this method. Now that we have chosen a method we will tune the generating process of the absence points




4.3 Tuning the (Pseudo-)Absence points generation


There are many reasons why a species doesn’t appear everywhere it could according to its environmental needs. This could be since the species hasn’t reached the place yet because of barriers, competition or the species appears there but hasn’t been detected yet.So just randomly generating the pseudo absence data will of course lead to class overlap and now we will look at alternative methods to generate absence points if you don’t have them. There are essentially three different methods of generating Pseudo-absence. (Senay et. Al. 2013, p. 1f.)
1. Randomly generating the Absence points as we did so far. It is a very easy method but also will certainly lead to class overlap.
2. Spatially constrained generation The basic idea of this method is that the climatic values around the observation points are similar to the preferred values of the species. This means you only generate random absence points outside a certain distance of the observation points. (Senay et. Al. 2013, p. 2)
3. Generating Pseudo-absence points based on environmental variables meaning you use a presence-only prediction method first like Bioclim and then only generate the Pseudo-absence in the areas predicted by the method as unsuitable for the species. Models that generate Pseudo-absences are so called Two-step models. (Senay et. Al. 2013, p. 2)


4.3.1 Spatially constrained generation


In order to model the Spatially constrained method I first generated a buffer around all observation points with a radius of 20 kilometers


set.seed(0308)
absence_points <- spatSample(predictions_values, sum(butterfly$`df_bf$ï..species` == "Colias_erate")*10, "random", na.rm=TRUE, as.points = TRUE)


# We first have to convert the data frame of butterfly back to a simple feature object (sf)
sf_butterfly <- st_as_sf(butterfly, coords = c("X", "Y"), crs = crs)

# converting the absence points to an sf object
sf_absence_points <- st_as_sf(absence_points, crs = crs)

# then we can use the buffer function of the sf package in order to create the circles around the butterfly observation points
butterfly_buffer <- st_buffer(sf_butterfly, 20000)

# here you can see our example species and the absence points generated for it 
ggplot() + geom_sf(data = shape_Pakistan, fill = "white") +
geom_sf(data = sf_absence_points) +
geom_sf(data = butterfly_buffer$geometry[butterfly_buffer$`df_bf$ï..species` == "Colias_erate"], fill = "transparent", size = 1.05) +
labs(x = "Longitude", y = "Latitude", title = "Absence points and the buffers around Observation points of Colias erate")

# now there are many Pseudo-absence points in the vicinity of our presence points. These will cause our class overlap

There are many Pseudo-absence points in the vicinity of our presence points. These will cause our class overlap and have negative influence on our models performance.

plot(predictions_values$Annual_Mean_Temperature)
plot(butterfly_buffer$geometry[butterfly_buffer$`df_bf$ï..species` == "Colias_erate"], add = TRUE)

plot(predictions_values$Annual_Mean_Temperature)
plot(butterfly_buffer$geometry[butterfly_buffer$`df_bf$ï..species` == "Colias_erate"], add = TRUE)
plot(absence_points, add = TRUE)


Here you can see the Annual mean temperature against our buffers, Of course the temperature is very heterogeneous in the Himalaya in this region and our relatively big buffers will reduce the ability of the model to understand the climatic zone there but there are still enough pseudo absence points outside the buffer to perform


# using st_intersection() to collect all absence points within the buffer zones for our selected species
Index <- st_intersection(sf_absence_points,
                         butterfly_buffer$geometry[butterfly_buffer$`df_bf$ï..species` 
                                                   == "Colias_erate"])

ggplot() + geom_sf(data = shape_Pakistan, fill = "white") +
geom_sf(data = sf_absence_points) +
geom_sf(data = butterfly_buffer$geometry[butterfly_buffer$`df_bf$ï..species` == "Colias_erate"], fill = "#ff5047") + 
geom_sf(data = Index, size = 2, colour = "#89db7b") + 
labs(x = "Longitude", y = "Latitude", title = "Absence points in the buffer")


Now we have selected all points in the buffer zones.


# converting the absence points back to a data frame
df_absence_points <- as.data.frame(sf_absence_points, xy = TRUE)

# converting the selected absence points within the buffer zone to a data frame
df_Index <- as.data.frame(Index, xy = TRUE)

# now we will use a little trick from the dplyr package, the function anti_join removes all entries in a data frame that match the values of another data frame - this means we remove all absence points within the buffer zones from our absence points using the geometry column
distanced_absence_points <- anti_join(df_absence_points, df_Index, by = "geometry")


And as you can see we now have removed all absence points inside the buffer zones (You can still see some right at the border). All together 42 Pseudo-absence points were removed this way. Now we will run the process of getting our Train and Test data again and after that try the new data on the Classical and the Down sampling Random Forest

presence_points <- terra::extract(predictions_values, butterfly[butterfly$`df_bf$ï..species` == "Colias_erate",2:3])
presence_points_minus_ID <- subset(presence_points, select = -ID) 

# removing the Geometry column which remained from the sf object
absence_points_distance <- subset(distanced_absence_points, select = -geometry)


pres_abs <- c(rep(1, nrow(presence_points_minus_ID)), rep(0, nrow(absence_points_distance)))
pres_abs_data <- data.frame(cbind(pres_abs, rbind(presence_points_minus_ID, absence_points_distance))) # combining both
pres_abs_data[ , "pres_abs"] <- as.factor(pres_abs_data[ , "pres_abs"])
set.seed(0308)
sample <- sample.int(n = nrow(pres_abs_data), size = floor(0.6*nrow(pres_abs_data)), replace = FALSE)
train <- pres_abs_data[sample, ]
test <- pres_abs_data[-sample, ]


# now lets use the new data in our models
set.seed(0906)
prediction_model_sc <- randomForest(pres_abs ~ ., data = train, replace = TRUE, importance = TRUE, mtry = 3, ntree = 1000, na.action = na.roughfix, do.trace = FALSE)

set.seed(0906)
prNum <- as.numeric(table(train$pres_abs)["1"]) # number of presence records
spsize <- c("0" = prNum, "1" = prNum) # sample size for both classes
prediction_model_downsampling_sc <- randomForest(pres_abs ~ ., data = pres_abs_data, replace = TRUE, importance = TRUE, mtry = 3, ntree = 1000, na.action = na.roughfix, do.trace = FALSE, sampsize = spsize)

prediction_model_sc
## 
## Call:
##  randomForest(formula = pres_abs ~ ., data = train, replace = TRUE,      importance = TRUE, mtry = 3, ntree = 1000, do.trace = FALSE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 10.09%
## Confusion matrix:
##     0  1 class.error
## 0 372 20  0.05102041
## 1  24 20  0.54545455
table_test_classic <- table(test$pres_abs, predict(prediction_model_sc, newdata= test))
table_test_classic
##    
##       0   1
##   0 254  12
##   1  11  15
prediction_model_downsampling_sc
## 
## Call:
##  randomForest(formula = pres_abs ~ ., data = pres_abs_data, replace = TRUE,      importance = TRUE, mtry = 3, ntree = 1000, do.trace = FALSE,      sampsize = spsize, na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 12.77%
## Confusion matrix:
##     0  1 class.error
## 0 573 85   0.1291793
## 1   8 62   0.1142857
table_test <- table(test$pres_abs, predict(prediction_model_downsampling_sc, newdata=test))
table_test
##    
##       0   1
##   0 235  31
##   1   1  25


Already you can see a huge improvement compared to our models before. We now look at our Error rates and our AUC compared to the models before. As I said a AUC of 1 would be perfect model so a model as in our case with a AUC of 0.9783112 it is a very good performing model. Even our Classical standard Random Forest improved it’s Error 1 rate very much.




4.3.2 Generating Pseudo-absence points based on environmental variables


Up until now I was able to create all the models without using raster or sp (they will be imported by dismo). Sadly I didn’t manage to find a package including BIOCLIM which doesn’t require this packages. So just for this single step we have to live with these packages. Note: in the near future there might be new packages available that work only with terraand sf. I won’t cover how BIOCLIM works in this Tutorial. For further information you should read the explanation on https://support.bccvl.org.au/support/solutions/articles/6000083201-bioclim.

library("dismo")
library("rgdal")
# converting our observation coordinates into a matrix
pres_data <- as.matrix(butterfly[butterfly$`df_bf$ï..species` == "Colias_erate",2:3])

# converting our prediction values from a Spatraster to a Rasterstack 
predictions_values_2step <- brick(predictions_values)

# using the Bioclim function from dismo
set.seed(0308)
bioclim <- bioclim(predictions_values_2step, pres_data)

# predicting our model
prediction_bioclim <- predict(predictions_values_2step, bioclim)

# converting the prediction back to a Spatraster
prediction_bioclim <- terra::rast(prediction_bioclim)

plot(prediction_bioclim, main = "Bioclim prediction of Colias erate", xlab = "Longitude", ylab = "Latitude")


This is what BIOCLIM predicts the species range to be. For generating our absence points in areas with low probability we will use three different thresholds: 10%, 20% and 40% (they were chosen by comparing our observation points to the prediction and by looking at the amount of area then used for sampling the Pseudo-absences.

# first defining a temporary filter with the threshold of 10%
tmpfilter <- prediction_bioclim < 0.1

#then masking the prediction_bioclim area using this temporary filter
absence_area_01 <- mask(tmpfilter,prediction_bioclim, maskvalue = 1)



# repeating the steps before for different values
tmpfilter <- prediction_bioclim < 0.2
absence_area_02 <- mask(tmpfilter,prediction_bioclim, maskvalue = 1)
tmpfilter <- prediction_bioclim < 0.4
absence_area_04 <- mask(tmpfilter,prediction_bioclim, maskvalue = 1)
detach()

# setting names
names(absence_area_01) <- "Absence Area, Threshold 10%"
names(absence_area_02) <- "Absence Area, Threshold 20%"
names(absence_area_04) <- "Absence Area, Threshold 40%"

# combining them into one Spatraster
Absence_area_overview <- absence_area_01
Absence_area_overview <- terra::`add<-`(Absence_area_overview,  absence_area_02)
Absence_area_overview <- terra::`add<-`(Absence_area_overview,  absence_area_04)


#plotting
plot(Absence_area_overview)



It is obvious that the area availible for generating the absence points varies much between those thresholds. We will continue to use Absence Area, Threshold 10% as the other areas are rather small:

presence_points <- terra::extract(predictions_values, butterfly[butterfly$`df_bf$ï..species` == "Colias_erate",2:3])
presence_points_minus_ID <- subset(presence_points, select = -ID)

# setting the seed and generate the absence points (still in whole Pakistan)
set.seed(0308)
absence_points <- spatSample(absence_area_01, sum(butterfly$`df_bf$ï..species` == "Colias_erate")*10, "random", na.rm=TRUE, as.points = TRUE)

# taking all the absence points in the predefined area of absence generated with our BIOCLIM
absence_points_in_area <- absence_points[absence_points$layer == 1]

#plotting it
plot(absence_area_01, main = "Absence points for BIOCLIM Colias erate 10% threshold")
plot(absence_points_in_area, add = TRUE)


You can see the here we managed to only generate the absence points outside the areas which are likely suitable for Colias erate.



# getting the coordinates of the absence points
coords_absence <- crds(absence_points_in_area, df=TRUE)

# extracting the values in the predictions values for the new absence points
absence_points <- terra::extract(predictions_values, coords_absence)

datatable(absence_points)



The only problem left is that the extract function added a ID column again. After removing this we can use the exact same codes we used so far. This time 103 Absence points were removed.
Redoing all the data preparation:


# removing the ID colomn in the Absence points
absence_points_minus_ID <- subset(absence_points, select = -ID)


pres_abs <- c(rep(1, nrow(presence_points_minus_ID)), rep(0, nrow(absence_points_minus_ID))) 
pres_abs_data <- data.frame(cbind(pres_abs, rbind(presence_points_minus_ID, absence_points_minus_ID)))
pres_abs_data[ , "pres_abs"] <- as.factor(pres_abs_data[ , "pres_abs"])
set.seed(0308)
sample <- sample.int(n = nrow(pres_abs_data), size = floor(0.6*nrow(pres_abs_data)), replace = FALSE) 
train <- pres_abs_data[sample, ]
test <- pres_abs_data[-sample, ]



Now that we got our absence points where we want them, we will look on the models performance for the Classical Random forest and the Down sampling one just as in 4.3.1:

# now lets use the new data in our models again
set.seed(0906)
prediction_model_2step <- randomForest(pres_abs ~ ., data = train, replace = TRUE, importance = TRUE, mtry = 3, ntree = 1000, na.action = na.roughfix, do.trace = FALSE)

set.seed(0906)
prNum <- as.numeric(table(train$pres_abs)["1"]) # number of presence records
spsize <- c("0" = prNum, "1" = prNum) # sample size for both classes
prediction_model_downsampling_2step <- randomForest(pres_abs ~ ., data = pres_abs_data, replace = TRUE, importance = TRUE, mtry = 3, ntree = 1000, na.action = na.roughfix, do.trace = FALSE, sampsize = spsize)

prediction_model_sc
## 
## Call:
##  randomForest(formula = pres_abs ~ ., data = train, replace = TRUE,      importance = TRUE, mtry = 3, ntree = 1000, do.trace = FALSE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 10.09%
## Confusion matrix:
##     0  1 class.error
## 0 372 20  0.05102041
## 1  24 20  0.54545455
table_test_classic <- table(test$pres_abs, predict(prediction_model_2step, newdata= test))
table_test_classic
##    
##       0   1
##   0 236   3
##   1  12  16
prediction_model_downsampling_2step
## 
## Call:
##  randomForest(formula = pres_abs ~ ., data = pres_abs_data, replace = TRUE,      importance = TRUE, mtry = 3, ntree = 1000, do.trace = FALSE,      sampsize = spsize, na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 8.1%
## Confusion matrix:
##     0  1 class.error
## 0 550 47  0.07872697
## 1   7 63  0.10000000
table_test <- table(test$pres_abs, predict(prediction_model_downsampling_2step, newdata=test))
table_test
##    
##       0   1
##   0 222  17
##   1   2  26
# Now we will add the AUC and its error Rates to our Random Forest methods. Additionally I will run this using the threshold 20% and 40% so we can compare it. (Code is the same so i won't include it here)


The Down Sampling Two step method 10% Threshold has the best perfomance for the AUC closely followed by 20% Threshold and or Spatially Constrained Down sampling. Threshold 20%, 40% and Down Sampling Spatially Constrained have similary good results for the Error 1 using the test data. For our final Species Richness Map we will use the Down Sampling Spatially Constrained method as it is only slightly worse in predicting but doesn’t rely on the raster and sp packages. Let’s have a overview over all the predicted areas for Colias erate with all the different methods.




Picture 12 Methods of generating Random Forests, x axis = Longitude, y axis = Latitude


Picture 14 Prediction of presence area of Colias erate with methods for alternative Pesudo-absence points generation

Additionally as you can see in Picture 14 you can not see the difference between the Down sampling spatially constrained method and the Down sampling Two step methods. Now we will tune the number of Tree (ntree) and the Variable tried at each split (mtry).




4.4 Tuning ntree and mtry


First we will try to find the best number of Trees we have to make. Generally in the beginning with only a few tree, Random Forest tend to fluctuate quite bit. The performance of AUC shows a asymptomatic behavior meaning that the performance can’t be increased at one point anymore even if you double the amount of trees. This means that at one point increasing the amount of trees doesn’t improve the performance of the Random Forest but only increases it computational cost. (Oshiro et. Al. 2012, p. 161 ff.)

4.4.1 Tuning ntree

We will now look were we can draw the line for our Down Sampling Spatially Constrained Model (as we will use it for our Species richness map) and more Trees don’t improve the performance anymore. The code for the diagram was inspired by https://www.youtube.com/watch?v=J4Wdy0Wc_xQ&t=16s.

# creating a data frame with the OOB Error estimates for the different amount of trees using our Down Sampling Spatially Constrained Model
oob.error.data <- data.frame(Trees=rep(1:nrow(prediction_model_downsampling_sc$err.rate), times = 3), Type = rep(c("OOB", "1", "0"), each = nrow(prediction_model$err.rate)), Error = c(prediction_model_downsampling_sc$err.rate[, "OOB"], prediction_model_downsampling_sc$err.rate[, "1"], prediction_model_downsampling_sc$err.rate[, "0"]))

# plotting the performance by the Amount of Trees for the different errors 
ggplot(data = oob.error.data, aes(x = Trees, y = Error))+ geom_line(aes(color = Type))


You can see in the beginning the already named fluctuation. At 500 trees there is a deviation but after 750 trees the model doesn’t improve anymore. Just to be sure as we have also other data sets we will use 800 Trees for our Species Richness map.


4.4.2 Tuning mtry


Mtry tells the Random Forest how many variables it should try at each split. Normally you could use the tuneRF() or the train() function from the caretpackage. While testing I managed to create Random Forest models with it but both can’t handle Down sampling yet as they can’t use our sampsize with columns (sample sizes) for our Samplesizes. Also the predict()function, used to create a map where the Butterflies could be, doesn’t work with these functions. So I will show you how the tuneRF() and the train() functions are build and then show you a solution I have built using a Loop.

TuneRF

##############################  tuneRF() #####################################################


# x = is the predictors variables (our column 2:8 in the training data), 
# y = is the response vector (our presence and absence), 
# ntreeTry the number of trees the should be used at each tuning step
# mtryStart is the mtry the program should start with
# Stepfactor tells R how much mtry should be infalted at each step 
# improve how much the improvement in the OOB error has to be in order to have the search continued
# and finally doBest = True generates a Random Forest using the optimal mtry
prediction_train <- tuneRF(x = train[2:8], y = train$pres_abs,  nTreeTry = 800,  mtryStart = 3, stepFactor = 1.5, improve = 0,  doBest = TRUE, plot = FALSE, trace = FALSE) 
## -0.09090909 0 
## -0.1136364 0
# remember until now you won't be able to use down sampling with this function nor to use predict()



############################## caret::train() #################################################

# we need the package caret
library("caret")

# first we have to create a data frame which tells caret what mtry it should try in the tuning process
tuneGrid <- data.frame("mtry" = c(2,3,4,5,6,7))


# as in tuneRF() function it uses the predictors and the presence absence data
# then we want the method Random Forest (method = "rf")
# it should train based on the OOB Error using the trainControl() function
# and it should use our tuneGrid data frame to tune our Random Forest
# Carot is able to use functions from the randomForest package so the other points are from our normal random Forest
set.seed(0906)
predicton_train_caret <- caret::train(train[2:8],
                         train$pres_abs,
                         method="rf",
                         trControl = caret::trainControl(method="oob", number = 10),
                         tuneGrid = tuneGrid, replace = TRUE,  
                         na.action = na.roughfix, 
                         do.trace = FALSE)

# The command will give us a object in which there is a Random forest object that you get by using predicton_train_caret$finalModel

# also this model won't be able to use down sampling with this function nor to use predict()



Because we want to find the best mtry for our Down sampling Spatially constrained I let 4.3.1 rerun in the background so we get the training and testing data from the model there.

############################## Own Solution #################################################

# first we create a Data frame so we can store the results generated during the loop
oob.error.mtry <- data.frame("mtry" = 100, "OOB" = 100, "Error1" = 100)


# now it should test every number between 2 and 7 at mtry
for (i in 2:7) {

  # generating our normal Random Forest but with mtry = i
  prNum <- as.numeric(table(train$pres_abs)["1"]) # number of presence records
  spsize <- c("0" = prNum, "1" = prNum) # sample size for both classes
  
  set.seed(0906) # setting a seed
  temp.model <- randomForest(pres_abs ~ ., data = pres_abs_data, replace = TRUE, importance = TRUE,
                             mtry = i, ntree = 1000, na.action = na.roughfix, do.trace = FALSE,
                             sampsize = spsize)

  # collecting the Error Rates with the according mtry
  temp <- cbind("mtry" = i, "OOB" = temp.model$err.rate[1],"Error1" = temp.model$err.rate[3])

  # adding the error rates to our data frame
  oob.error.mtry <- rbind(oob.error.mtry, temp)
}



datatable(oob.error.mtry[-1,])


You can see in this data table that mtry = 6 has the best OOB yet mtry = 3 has the best performance for Error 1. We will now save the mtry with the best performance for Error 1 because reducing this error is more important than a low OOB Error from “good” classified Pseudo absence data.

# here we get the mtry with the best OOB error
mtry_model <- oob.error.mtry$mtry[oob.error.mtry$Error1 == min(oob.error.mtry$Error1)]



# using this mtry in our model
set.seed(0906)
prNum <- as.numeric(table(train$pres_abs)["1"]) # number of presence records
spsize <- c("0" = prNum, "1" = prNum) # sample size for both classes
prediction_model_downsampling_sc <- randomForest(pres_abs ~ ., data = pres_abs_data, replace = TRUE, importance = TRUE, mtry = mtry_model, ntree = 1000, na.action = na.roughfix, do.trace = FALSE, sampsize = spsize)

Now we tuned all we can and will finally turn to creating our Species Richness map.






5. Looping


For the loop we will use the Spatially Constrained downsampling method as this method was very good performing and we want to create a model without raster and sp. First we have to do some data preparation again.

# first we add our first prediction because the model needs a data where it can add to other species to, also we create a data frame so we can see how good the species models performed 
species_richness_prediction <- predict(predictions_values, prediction_model_downsampling_sc)
OOB <- data.frame("Species" = "Colias_erate", "Estimate OOB Error" = 0.1277472527, "Error 0" = 0.1291793, "Error 1" = 0.116666666666667, "N° Observation Points" = 70) #data frame 


# we have to turn the species into a factor
butterfly$`df_bf$ï..species` <- as.factor(butterfly$`df_bf$ï..species`)

#so it won't calculate every single name over and over we create a list which contains all butterfly names once
species <- levels(butterfly$`df_bf$ï..species`)
species_10 <- "Colias_erate"


As I already said there are many butterflies which have very few observation points. We will only use species with at least 10 observation points in order to get meaningful predictions.

# using a loop that checks how many observation points a species has and then if they do adds them to the species_10
for (i in species) {
  sum_butterfly <- sum(butterfly$`df_bf$ï..species` == i)
  if(sum_butterfly > 9){ species_10 <- append(species_10, subset(species, species == i))}
}


# as we added Colias erate already before to the species_10 and the Species_richness_prediction so the program has something it can add the results to we now remove it from our species_10 list  
species_10 <- species_10[-43]

species_10 <- species_10[-1]
# after this step we have 211 remaining species from our original 429. 


Now we can finally create the loop which does the steps from 3.5 resp. 4.3.1 for each species in our species_10. In the end it predicts the species distribution and then adds this prediction to the Species_richness_prediction and the error rates to the OOB data frame. Also for each model there is a loop in the loop which tries to find the best mtry as shown in 4.4.2. We will use the whole data and not split into training and testing data as even with 10 observation points the data would get to small to train a good model resp. to do a meaningful test with it.


for (i in species_10) {
  
  # data preparation
    # getting the data for the observation points
    presence_points <- terra::extract(predictions_values, butterfly[butterfly$`df_bf$ï..species` ==
                                                                      i ,2:3])
    # generating Pseudo absences
    set.seed(0308)
    absence_points <- spatSample(predictions_values, sum(butterfly$`df_bf$ï..species` == i )*10,
                                 "random", na.rm=TRUE, as.points = TRUE)
    sf_butterfly <- st_as_sf(butterfly, coords = c("X", "Y"), crs = crs)
    
    # removing all absence points within the buffers for the species as shown in 4.3.1
    sf_absence_points <- st_as_sf(absence_points, crs = crs)
    Index <- st_intersection(sf_absence_points,
                         butterfly_buffer$geometry[butterfly_buffer$`df_bf$ï..species` 
                                                   == i ])
    
    df_absence_points <- as.data.frame(sf_absence_points, xy = TRUE)
    df_Index <- as.data.frame(Index, xy = TRUE)
    distanced_absence_points <- anti_join(df_absence_points, df_Index, by = "geometry")

    presence_points_minus_ID <- subset(presence_points, select = -ID) 
    absence_points_distance <- subset(distanced_absence_points, select = -geometry)

    pres_abs <- c(rep(1, nrow(presence_points_minus_ID)), rep(0, nrow(absence_points_distance)))
    pres_abs_data <- data.frame(cbind(pres_abs, rbind(presence_points_minus_ID, ... =
                                                        absence_points_distance)))
    pres_abs_data[ , "pres_abs"] <- as.factor(pres_abs_data[ , "pres_abs"])

    # preparing the down sampling part
    prNum <- as.numeric(table(pres_abs_data$pres_abs)["1"])
    spsize <- c("0" = prNum, "1" = prNum) 

    
# finding the best mtry
    
  for (z in 2:7) {

    oob.error.mtry <- data.frame("mtry" = 100, "OOB" = 100, "Error1" = 100)
    set.seed(0906)
    temp.model <- randomForest(pres_abs ~. , data = pres_abs_data, replace = TRUE, importance =
                                 TRUE, mtry     = z, ntree = 800, na.action = na.roughfix, do.trace
                               = FALSE, sampsize = spsize)
    temp <- cbind("mtry" = z, "OOB" = temp.model$err.rate[1],"Error1" = temp.model$err.rate[3] )
    oob.error.mtry <- rbind(oob.error.mtry, temp)
  }

  # choosing the mtry with the lowest Error 1
  mtry_model <- oob.error.mtry$mtry[oob.error.mtry$Error1 == min(oob.error.mtry$Error1)]


  
# generating the model 
  
  set.seed(0906)
  prediction_model_downsampling_sc <- randomForest(pres_abs ~ ., data = pres_abs_data, 
                                                  replace =TRUE,
                                                  importance = TRUE, mtry = mtry_model, ntree = 800,
                                                  na.action = na.roughfix, do.trace = FALSE, 
                                                  sampsize = spsize)


  
# collecting the data from the model
  
  # getting the Error rates
  
  new_row <- c(i , prediction_model_downsampling_sc$err.rate[800, 1:3],
               sum(butterfly$`df_bf$ï..species` == i))
  OOB <- rbind(OOB, new_row)
  
  # predicting the presence area of the species
    prediction_downsampling_sc <- predict(predictions_values, prediction_model_downsampling_sc) 
    
    # adding the prediction as a layer to our Species_richness_prediction
    species_richness_prediction <- terra::`add<-`(species_richness_prediction,
                                                  prediction_downsampling_sc)
    
}

# as seen before we have to change the numbers back 
species_richness_prediction[species_richness_prediction == 1] <- 0
species_richness_prediction[species_richness_prediction == 2] <- 1

# now we use the tapp() function again only this time we calculate the sum so weg the amount of species in a certain place
species_richness <- tapp(species_richness_prediction, index=c(1), fun=sum, na.rm=TRUE)

writeRaster(file.path(wd, "/Output/Species_richness.tif")))




Picture 13 Species Richness Map of butterflies in Pakistan


And there it is. Our final Species Richness Map. We can now look at the OOB Error Rates of all the Species.



Of the 212 Species there are 52 Species that have a Error 1 higher than 20% while 20 have a Estimated OOB Error of more than 20%.

boxplot(OOB$Estimate.OOB.Error, main = "Estimate OOB Error")

boxplot(OOB$`Error 0`, main = "Error 0")

boxplot(OOB$Error.1, main = "Error 1")

plot(OOB$`N° of Observation Points`, OOB$Estimate.OOB.Error, main = "Estimate OOB Error in comparison to N° of Observation Points")


There is a slight connection between the N° of Observation points and the Estimated OOB Error



Sources



Barbet-Massin, M.; Albert, C.; Jiguet, F. & W. Thuiller (2012): Selecting pseudo-absences for species distribution models: how, where and how many?. Methods in Ecology and Evolution 2012, 3, p 327–338. Online available under https://www.researchgate.net/publication/229149956_Selecting_Pseudo-Absences_for_Species_Distribution_Models_How_Where_and_How_Many. Last checked 31.05.2022.

Cutler DR, Edwards Jr TC, Beard KH, Cutler A, Hess KT, Gibson J, Lawler JJ (2007) Random forests for classification in ecology. Ecology, 88(11): 2783-2792. Online available under https://support.bccvl.org.au/support/solutions/articles/6000083217-random-forest. Last checked 13.06.2022.

Genuer R. & J.-M. Poggi (2020): Random Forests in R. Cham. Online available under https://link.springer.com/content/pdf/10.1007%2F978-3-030-56485-8.pdf. Last checked 13.06.2022.

Gupta S. (2021): Regression vs. Classification in Machine Learning: What’s the Difference?. Springboard. Online available under https://www.springboard.com/blog/data-science/regression-vs-classification/. Last checked 31.05.2022.

Hijmans R. & J. Elith (o.d.): Species distribution modelling. Spatial Data Science with R. online available under https://rspatial.org/raster/sdm/index.html#species-distribution-modeling. Last checked 31.05.2022.

Kho, J. (2018): Why Random Forest is My Favorite Machine Learning Model. Toward Data Science. Online available under https://towardsdatascience.com/why-random-forest-is-my-favorite-machine-learning-model-b97651fa3706. Last checked 13.06.2022.

o. A. (2021): Random Forest. Biodiversity and Climate Change Virtual Laboratory (bccvl). Online available under https://support.bccvl.org.au/support/solutions/articles/6000083217-random-forest. Last checked 13.06.2022.

Oshiro, T.; Perez, P. & J. Baranauskas (2012): How Many Trees in a Random Forest?. Lecture Notes in Computer Science. Volume: 7376. p. 154-168. Online available under https://www.researchgate.net/publication/230766603_How_Many_Trees_in_a_Random_Forest. Last checked 04.06.2022.

Saadat, H.; Chaudhry, M.; Manzoor, F. & G. Nasim (2016): Effect of climate change on butterfly population of selected coniferous forests of Murree Hills and adjacent areas, Pakistan. Pakistan J. Zool., vol. 48(6), pp. 1963-1969. Online available under https://www.researchgate.net/publication/310619804_Effect_of_climate_change_on_butterfly_population_of_selected_coniferous_forests_of_Murree_Hills_and_adjacent_areas_Pakistan. Last checked 13.06.2022.

Senay, S. D., Worner, S. P., & Ikeda, T. (2013). Novel three-step pseudo-absence selection technique for improved species distribution modelling. PloS one, 8(8), e71218. Online availible under https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3742778/pdf/pone.0071218.pdf. Last checked 02.06.2022.

Valavi, R.; Elith, J.; Lahoz-Monfort, J. & G. Guillera-Arroita (2021): Modelling species presence-only data with random forests. In Ecography. Volume 44. Issue 12. P. 1731-1742. Online available under https://onlinelibrary.wiley.com/doi/full/10.1111/ecog.05615. Last checked 13.06.2022.

Yiu, T. (2019): Understanding Random Forest. Towards Data Science. Online available under https://towardsdatascience.com/understanding-random-forest-58381e0602d2. Last checked 13.06.2022.




sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] caret_6.0-92        lattice_0.20-45     dismo_1.3-5        
##  [4] raster_3.5-15       sp_1.4-7            DT_0.23            
##  [7] ROCR_1.0-11         dplyr_1.0.9         randomForest_4.7-1 
## [10] ggplot2_3.3.6       sf_1.0-7            terra_1.5-21       
## [13] rnaturalearth_0.1.0
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-153            lubridate_1.8.0         tools_4.1.2            
##  [4] bslib_0.3.1             utf8_1.2.2              rgdal_1.5-32           
##  [7] R6_2.5.1                rpart_4.1-15            KernSmooth_2.23-20     
## [10] DBI_1.1.2               colorspace_2.0-3        nnet_7.3-16            
## [13] withr_2.5.0             rnaturalearthdata_0.1.0 tidyselect_1.1.2       
## [16] compiler_4.1.2          cli_3.3.0               labeling_0.4.2         
## [19] sass_0.4.1              scales_1.2.0            classInt_0.4-3         
## [22] proxy_0.4-26            stringr_1.4.0           digest_0.6.29          
## [25] rmarkdown_2.14          pkgconfig_2.0.3         htmltools_0.5.2        
## [28] parallelly_1.31.1       fastmap_1.1.0           highr_0.9              
## [31] htmlwidgets_1.5.4       rlang_1.0.2             rstudioapi_0.13        
## [34] jquerylib_0.1.4         farver_2.1.0            generics_0.1.2         
## [37] jsonlite_1.8.0          crosstalk_1.2.0         ModelMetrics_1.2.2.2   
## [40] magrittr_2.0.3          s2_1.0.7                Matrix_1.3-4           
## [43] Rcpp_1.0.8.3            munsell_0.5.0           fansi_1.0.3            
## [46] lifecycle_1.0.1         pROC_1.18.0             stringi_1.7.6          
## [49] yaml_2.3.5              MASS_7.3-54             plyr_1.8.7             
## [52] recipes_0.2.0           grid_4.1.2              parallel_4.1.2         
## [55] listenv_0.8.0           crayon_1.5.1            splines_4.1.2          
## [58] knitr_1.39              pillar_1.7.0            stats4_4.1.2           
## [61] reshape2_1.4.4          future.apply_1.9.0      codetools_0.2-18       
## [64] wk_0.6.0                glue_1.6.2              evaluate_0.15          
## [67] data.table_1.14.2       vctrs_0.4.1             foreach_1.5.2          
## [70] gtable_0.3.0            purrr_0.3.4             future_1.25.0          
## [73] assertthat_0.2.1        xfun_0.31               gower_1.0.0            
## [76] prodlim_2019.11.13      e1071_1.7-9             class_7.3-19           
## [79] survival_3.2-13         timeDate_3043.102       tibble_3.1.7           
## [82] iterators_1.0.14        hardhat_0.2.0           units_0.8-0            
## [85] lava_1.6.10             globals_0.15.0          ellipsis_0.3.2         
## [88] ipred_0.9-12