1 Theory of a Random Forest


The Random Forest (or Random Decision Forest) is a model in the area of machine learning. It is an extension of Classification and Regression Trees (CART). A Random Forest grows many classification trees and puts the input data down each of the trees in the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes over all the trees in the forest.




1.1 What is a Classification Tree?


Warning!

It is assumed that the user knows about the concept of single classification trees. If not, this should be studied independently beforehand. Further Information can be found on Wikipedia.


Quick Reminder:

Let’s go quickly over decision trees as they are the building blocks of the random forest model. Fortunately, they are pretty intuitive and most people have used them in their lives, knowingly or not.

Decision trees build classification or regression models in form of a tree structure. The model breaks down a dataset into smaller and smaller subsets while at the same time incrementally developing an associated decision tree. The final result is a tree with decision (internal) nodes and leaf nodes (see image below). Furthermore, a branch that splits left is called “left daughter” and one that splits right “right daughter”.

Let’s consider the following example in which we use a decision tree to decide upon an activity on a particular day:

Example for a simple everyday decision tree.

Img. 1: Example for a simple everyday decision tree by nexttech.

Applied to a real life scenario our data will not be this clean but the logic that a decision tree employs remains the same.




1.1.1 The Difference between a Classification Tree and a Random Forest


A Random Forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the Random Forest generates a class prediction (vote) and the class with the most votes becomes our model’s prediction over all the trees in the forest (see image below, hover to enlarge).


comparison between a single decision tree and a random forest

Img. 2: Comparison between Decision Trees and Random Forest.

The image above shows an outline of a comparison between a single Decision Tree and a Random Forest Model. To predict a classification the model needs to be filled with these five attributes A, B, C, D, E, F. Then some data is randomly picked for generating small subsets of data. Afterwards, the chosen sample is put back into the statistical population for reuse. This process is called bootstrap-sample and is pointed out in decision tree 1. At the first node it uses only the three attributes A, D and F. For every following node it continues to generate a new bootstrap sample of another three randomly picked attributes (A, B and E) until it ends with a terminal node. Every decision tree ends with a majority vote in which the class with the most votes wins the classification. For a regression, the mean is calculated from all votes defining the class. This process takes place for every decision tree.

The number of trees is determined via the parameter “ntree” and the number of randomly selected attributes via “mtry” (blue squares). In this example mtry is three and ntree is n. 

These subsets of randomly picked attributes are also known as bootstrap samples. This is important to understand how the model verifies itself through the OOB Error Rate which makes crossvalidating unnecessary.

The function “randomForest” can take a formula or, in two separate arguments, a data frame with predictor variables (previously called “attributes”), and a vector with the result. If the result variable is a factor (categorical), randomForest will do a classification, otherwise it will do a regression. Whereas with species distribution modeling we are often interested in classification (species is present or not).




1.1.2 Wisdom of Crowds


The fundamental concept behind Random Forest is a simple but powerful one: the wisdom of crowds. The reason why the random forest model works so well is:

  • A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.

  • The low correlation between models is the key.

Uncorrelated models can produce ensemble predictions that are more accurate than any of the individual predictions. The reason for this effect is that the trees protect each other from their individual errors (as long as they don’t all constantly err in the same direction).

That means while some trees may be wrong, many other trees will be right, so as a group the trees are able to move in the correct direction.

The model uses bagging and feature randomness (picking features randomly and putting them back) when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.

So, what do we need in order for our random forest to make accurate class predictions?

  1. We need features that have at least some predictive power (predictors).
  2. The trees of the forest and more importantly their predictions need to be uncorrelated or at least have low correlations with each other.

As said, Random Forest is able to generate classification and regression models. In the upcoming exercise only the classification method is used — we want to know what class an observation belongs to (presence or absence). The ability to precisely classify observations is extremely valuable for scientists. For example predicting whether a particular organism like an animal or a plant may be present or absent in an specific area. With this information certain initiatives e.g. species and environmental protection can be assured.




2 RUN FOREST, RUN!


Forrest Gump Gif from Giphy


Now it is time to build our forest. This needs to be done in five steps:
1. Loading the required packages
2. Loading Environmental Data
3. Data Preparation
4. Generating the Forest
5. Prediction

For further information regarding this method in general see Random Forests by Leo Breiman and Adele Cutler and especially CRAN R Package Information for randomForest.

Title: Breiman and Cutler’s Random Forests for Classification and Regression
Package: ‘randomForest’
Version: 4.6-14




2.1 Loading required packages


We will need to install/activate three packages: maptools, dismo and randomForest. But a total of five packages is installed. The remaining two packages “sp” and “raster” are already activated through maptools and dismo.

library(maptools) #includes the library of sp
## Warning: package 'maptools' was built under R version 3.5.3
## Loading required package: sp
## Warning: package 'sp' was built under R version 3.5.3
## Checking rgeos availability: FALSE
##      Note: when rgeos is not available, polygon geometry     computations in maptools depend on gpclib,
##      which has a restricted licence. It is disabled by default;
##      to enable gpclib, type gpclibPermit()
library(dismo) #includes the library of raster
## Warning: package 'dismo' was built under R version 3.5.3
## Loading required package: raster
## Warning: package 'raster' was built under R version 3.5.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.




2.2 Loading environmental data


First we load the predictor files into R. They are used to predict a feature variable. More information regarding the predictor files can be found at section Info Dataset.

setwd('C:/Users/Mandy/Documents/Uni/PROJECT_PAKISTAN/R/')
p_files <- list.files(path='C:/Users/Mandy/Documents/Uni/PROJECT_PAKISTAN/R/wc2.0_30s_predictor_files/', full.names=TRUE )


We then stack the single predictor files to a raster stack.

predictors <- stack(p_files) #Raster-stacking 9 predictor files to one file
#predictors # show Raster-Stack information is optional
#names(predictors) # show names of predictors is optional


This is what the raster stack now looks like.

plot(predictors)


Now we load the the borders of Pakistan to make it visually more attractive and load the presence points of our butterfly data. This data is taken from Tshikolovets & Pagès (2016): The Butterflies of Pakistan.

#library(maptools)
data(wrld_simpl) # load outline of world

# loading Data of butterflies
file <- paste(path='C:/Users/Mandy/Documents/Uni/PROJECT_PAKISTAN/R/distribution_merged_Pakistan/distribution_butterflies_pakistan.csv')
butterflies <- read.table(file, header=TRUE, sep=',')


Afterwards we plot the spatial data points also into the map.

# Plotting, first layer of the RasterStack prec
plot(predictors, 1) #plot prec
plot(wrld_simpl, add=TRUE) #plot worldmap on top of Bio1, note the "add=TRUE" argument with plot
points(butterflies, pch=19, cex=0.1, col='black') #add point on top of predictors and worldmap




2.3 Data Preparation


We now have a set of predictor variables (rasters) and presence points.
The next step involves extracting every value of every predictor file at the locations of the butterfly presence points.

Why do we do this? To know which climatic conditions the butterflies like and don’t like. With this information we generate a training and a testing set for the Random Forest model.

In particular this means we

  1. delete all unnecessary columns from the original database of presence points (species, subspecies and id). We only need the coordinates.

  2. extract all values of presence points from the predictor stack.

  3. generate randomly 500 absence points from background data (what background data is can be read on rspatial.org)

  4. combine presence and absence points into a single data frame in which the first column (variable ‘pb’) indicates whether this is a presence (1) or a background/absence (0) point.

  5. declare pb as factor; pb is a categorical variable (called a ‘factor’ in R ) and this explicit definition is important so it won’t be treated like any other numerical variable. Our goal is still a classification whether a spot is suitable for a butterfly or not. If we wouldn’t declare a factor, Random Forest would do a regression instead of classification.

  6. split the original dataset into a training and a testing set. With these sets we train and test the model.

#1 deleting unnecessary rows
butterflies <- butterflies[,-(3:5)] #-(3:5) = delete Position 3 to 5

#2 presence-points
presence <- extract(predictors, butterflies) # Extracting presence values from rasters

#3 setting random seed to always create the same random set of points for this example
set.seed(0)
background <- randomPoints(predictors, 400) # creating 400 random absence-points
absence <- extract(predictors, background) # Extracting absence values from rasters

#4 combining presence and absence with value 1 (present) and value 0 (absent)
pb <- c(rep(1, nrow(presence)), rep(0, nrow(absence))) 
sdmdata <- data.frame(cbind(pb, rbind(presence, absence))) # putting pb and climate data into a single dataframe 

#5 pb as factor
sdmdata[,'pb'] = as.factor(sdmdata[,'pb']) # set column 'pb' as factor, otherwise it'll do Regression

#6 splitting data in training and testing sets
set.seed(123) # set seed so that same sample can be reproduced in future 
sample <- sample.int(n = nrow(sdmdata), size = floor(.60*nrow(sdmdata)), replace = F) # selecting 60% of data as sample
train <- sdmdata[sample, ] # filling train set with 60% of original data
test  <- sdmdata[-sample, ] # filling test set with 40% of original data
summary(sdmdata)
##  pb            prec             srad          temp_avg          temp_max      
##  0: 400   Min.   :  1.00   Min.   : 5652   Min.   :-35.300   Min.   :-35.000  
##  1:5870   1st Qu.: 20.00   1st Qu.: 6894   1st Qu.:-11.900   1st Qu.: -6.700  
##           Median : 38.00   Median : 8044   Median :  3.000   Median :  9.500  
##           Mean   : 39.83   Mean   : 8733   Mean   : -1.043   Mean   :  4.956  
##           3rd Qu.: 57.00   3rd Qu.: 9880   3rd Qu.:  8.400   3rd Qu.: 15.800  
##           Max.   :108.00   Max.   :16268   Max.   : 19.200   Max.   : 26.200  
##           NA's   :134      NA's   :134     NA's   :134       NA's   :134      
##     temp_min            vapr             wind      
##  Min.   :-36.300   Min.   :0.0000   Min.   :0.400  
##  1st Qu.:-16.900   1st Qu.:0.1200   1st Qu.:1.100  
##  Median : -3.600   Median :0.3900   Median :1.300  
##  Mean   : -7.042   Mean   :0.4149   Mean   :1.464  
##  3rd Qu.:  1.400   3rd Qu.:0.6600   3rd Qu.:1.700  
##  Max.   : 15.600   Max.   :1.2600   Max.   :6.400  
##  NA's   :134       NA's   :134      NA's   :134




2.4 Generating Random Forest


First we take a look at the structure of this method and check out some important arguments:

randomForest(argument1, argument2, argumentx) is the name of the method and implements Breiman’s random forest algorithm based on Breiman and Cutler’s original Fortran code.

A list of all possible arguments can be found in this dokumentation from CRAN R Project.

argument result example
var_result ~ Variable var_result is required and should contain the factor presence or absence (1 or 0); the tilde with fullstop means that the prediction will be run with all remaining predictor files except itself; otherwise specific attributes must be named individually pb ~.
data An optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForest is called from; dataset for bootstrap-sample data=datasetTrain,
importance Should importance of predictors be assessed? Activates the variable importance computation. importance=TRUE,
importance=FALSE,
mtry Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3) mty=3,
mty=5,
ntree Number of trees to grow. This number should not be too small to ensure that every input row gets predicted at least a few times. The more trees are chosen, the longer it will take to compute. ntree=100,
ntree=1000,
replace Should sampling of cases be done with or without replacement? replace=TRUE,
replace=FALSE,
do.trace If set to TRUE, it gives a more verbose output as randomForest is run. If set to some integer, then running output is printed for every do.trace tree. do.trace=TRUE,
do.trace=FALSE,
keep.forest If set to FALSE, the forest will not be retained in the output object. If we want to take a deeper look into the structure of a single tree. keep.forest=TRUE,
keep.forest=FALSE,
na.action randomForest package has functions that “imputes Missing Values” e.g. “na.roughfix” imputes by median/mode. na.action=na.roughfix,




Finally we implement the Forest:

#library(randomForest)
set.seed(123) 
myForest<- randomForest(pb ~., data=train, importance=TRUE, mtry=3, ntree=200, replace=TRUE, do.trace=TRUE, keep.forest=TRUE, na.action=na.roughfix)
myForest
## 
## Call:
##  randomForest(formula = pb ~ ., data = train, importance = TRUE,      mtry = 3, ntree = 200, replace = TRUE, do.trace = TRUE, keep.forest = TRUE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 3.48%
## Confusion matrix:
##     0    1 class.error
## 0 124   97 0.438914027
## 1  34 3507 0.009601807

set.seed(123): A seed is used to initialize a pseudorandom number generator. If the same seed is deliberately shared, it becomes a key, so two or more systems using matching seeds can generate the same pseudorandom numbers in their own models to reproduce and match the results.

myForest <- is the name of the variable in which the forest is stored.

If we want to take a deeper look into one single tree we can use the function getTree. This function extracts the structure of a tree from a randomForest object represented as matrix. myForest is a randomForest object, k is a specific tree to extract and labelVar=TRUE means label names are used for splitting variables and predicted classes instead of meaningless numbers.

getTree(myForest, k=47, labelVar=TRUE)
Screenshot of the function getTree()

Img. 3: Screenshot of the function getTree().

What does this mean?

In column 1 row 1 the root of tree k asks, if split var is smaller than split point. Split var refers to the predictor files and split point to its values. The status delivers the answer in a Boolean expression of true (1) or false (-1).

  • Status 1 forwards to a numbered node (column 1) shown in the left and right daughter expanding the tree. Each node queries a different split variable. The prediction column remains as long as the answer is TRUE.

  • Status -1 terminates the node resulting in a prediction 1 or 0. The value is taken from the factor pb. This means a butterfly is present or absent.




2.4.1 OOB-Error Rate


The out-of-bag (OOB) score is a way of validating the Random Forest model. The original training data is randomly sampled-with-replacement generating small subsets of data. These subsets are also known as bootstrap samples. These samples are then fed as training data. Each of these decision trees is trained separately on these. The concluding result is determined by counting a majority vote from all the decision trees.

The sample that is “left out” in the original data is known as Out-of-Bag sample. This will not be used as training data. After the decision tree model has been trained, this leftover sample (or the OOB) will be given as unseen data to the decision trees. Then the tree will predict the outcome. It compares the OOB outcome with the real outcome in the dataset. If the outcomes are identical, a correct prediction was made. If they are not identical, a false prediction was made.

Long story short: The OOB score is computed as the number of correctly predicted outcomes from the out-of-bag sample.

Therefore the OOB error rate is a good estimate for the generalization error.

Below shows a confusion matrix with the OOB data. This means for generating the decision trees bootstrap sample y was NOT USED xx times. We can also calculate the mean OOB for all samples.

myForest$oob.times
mean(myForest$oob.times)

Img. 4: Screenshot of a Matrix showing the times a sample was not used (OOB).




Display of OOB error rate in a matrix. In the last row we can find the final OOB. This is the percentage for wrong or right prediction of presence | absence points.

myForest$err.rate
oob.err.rate

Img. 5: Screenshot of the OOB Error Rate.




This plot shows the evolution of the OOB over an increasing number of trees. Why is it important? Because you should ask yourself: “How many trees should I create within a forest?” Answer: There are large fluctuations with a low number of trees and leveling out with a higher number. On x axis the trees are shown, on y axis is the OOB error displayed.

plot(myForest$err.rate[,1], type="l")




2.5 Prediction


With the information of presence points the first prediction verifies our model.

  1. Feed the Forest with training data and calculate the reinstatement error.
class_train <- train$pb


The command ‘predict’ initiates the prediction. The previously generated forest (myForest) is used with the dataset ‘train’ again to verify the generated results. There should be a high matching quote and have the smallest possible reinstatement errors on the sample.

With the information of presence points the first prediction verifies our model. The command ‘predict’ initiates the prediction. The previously generated forest (myForest) is used with the dataset ‘train’ again to verify the generated results. There should be a high matching quote and the smallest possible reinstatement errors on the sample.

prediction_train <- predict(myForest, newdata=train) 

table_train <- table(class_train, prediction_train) # saving matrix in a table
table_train # perfect results?
##            prediction_train
## class_train    0    1
##           0  218    3
##           1    0 3464


  1. Feed the Forest with test data and calculate the generalization error.
class_test <- test$pb

prediction_test <- predict(myForest, newdata=test)

table_test <- table(class_test, prediction_test)
table_test # perfect results?
##           prediction_test
## class_test    0    1
##          0  100   79
##          1   26 2246




2.5.1 Plot the Prediction


And finally we are visualizing the prediction

r = raster(predictors, 1)
par(mfrow=c(1,2))
plot(!is.na(r), col=c('white', '#F6E3CE'),legend=FALSE, main='Presence of Butterlies')
plot(wrld_simpl, add=TRUE, border='dark grey')
points(butterflies, pch=19, cex=0.1, col='black')
prediction <- predict(predictors, myForest)

plot(prediction, col=c('#F6E3CE', 'black'), legend=FALSE, main='Possible Habitats for Butterflies')
plot(wrld_simpl, add=TRUE, border='dark grey')




2.5.2 Variable Importance Plot


The variable importance plot shows the importance for every variable in two methods: Mean Decrease Accuracy and Mean Decrease Gini. More information regarding these two see Louppe et. al (2013):. The importance increases from bottom to top.

imp <- varImpPlot(myForest)

imp
##          MeanDecreaseAccuracy MeanDecreaseGini
## prec                 24.51009         39.91361
## srad                 35.16589        128.51165
## temp_avg             20.72377         51.29401
## temp_max             19.11231         56.55813
## temp_min             20.70866         54.48442
## vapr                 20.29234         53.75408
## wind                 14.33831         26.55238

We can also display how often which variable was used

varUsed(myForest, by.tree=FALSE, count=TRUE)
## [1] 4164 5447 3641 4119 4229 3594 2752
head(sdmdata) #show variables names again
##   pb prec srad temp_avg temp_max temp_min vapr wind
## 1  1   40 6820    -10.9     -6.2    -15.7 0.10  1.6
## 2  1   49 8292      3.5      9.9     -3.0 0.39  1.4
## 3  1   66 8233      6.3     13.2     -0.5 0.52  1.1
## 4  1   66 9008      7.3     13.4      1.1 0.62  1.1
## 5  1   75 8484      5.8     11.9     -0.2 0.50  1.2
## 6  1   75 8162      4.3     10.2     -1.6 0.47  1.1




3 Info Dataset


Why did we choose those variables?


Img. 6: Butterfly at Malakand, Pakistan.Butterfly at Malakand, Pakistan

Solar Radiation (srad): The bodies of butterflies have to be warm (about 25 to 26 °C) to fly. Because they cannot regulate their body temperature themselves, they depend on their environment, more specifically the sunlight. To get the maximum exposure to warmth they bask in the sun with spreaded wings. But in hotter climates they can also easily overheat so they are usually only active in the early morning, late afternoon or early evening (aka the cooler parts of the day). They rest in the shade during the heat of the day. Butterflies living in cooler climates and moths may warm their bodies by vibrating their wings, and heat insulating hair.

Precipitation/Water Vapor (prec/vapr): As the Scientific American stated, butterflies are rarely seen during heavy rains and wind. Not only is rain a direct threat of injury or death, but the cool air associated with storms may also reduce temperatures below the thermal threshold for flying. Overcast skies limit their ability to gather the solar radiation needed to fly. When skies darken, butterflies seek shelter in their nighttime homes. Furthermore, a butterfly knocked out of the air by raindrops faces the double threat of crashing in an inhospitable habitat where predators lay in wait, and being unable to warm its body sufficiently to regain flight.

Wind speed: As Professor Robert Dudley, head of the Animal Flight Laboratory at U.C. Berkeley states, no matter the animal, if wind speed exceeds maximum flight speed, then it can’t fly. Different species of butterflies fly at different speeds, and the range is about 1.5 to 10 meters per second. If the windspeed is any higher than that, then the butterfly isn’t slipping the surly bounds of Earth to cut through it. He also says that many small butterflies avoid high winds as they would merely drift. A butterfly’s wing is more suceptible to ambient winds, so, while capable of flying into the wind, they aren’t exactly engineered for it. Nevertheless, windy days can also be useful. With help of tailwinds, migrating butterflies can travel at 100km/h at an altitude of several hundred metres.

Temperature (min/max/avg): As stated butterflies need warm temperatures to maintain their body warmth. It also affects the metamorphosis. Larva and pupae exposed to warmer temperatures develope into butterflies about twice as fast as those exposed to cooler temperatures. Therefore the temperature is an important variable.


We took the predictor files from WorldClim. It has average monthly climate data for minimum, mean, and maximum temperature and for precipitation for the years of 1970 to 2000. Variables can downloaded for different spatial resolutions, from 30 seconds (~1 km²) to 10 minutes (~340 km²). Each download is a “zip” file containing 12 GeoTiff (.tif) files, one for each month of the year (January is 1; December is 12).

Below the units of each predictor is shown:

Name of predictor file Description
temp_min minimum temperature in °C
temp_max maximum temperature in °C
temp_avg average temperature in °C
prec precipitation in mm
srad solar radiation in kJ m-2 day-1
vapr water vapor pressure in kPa
wind wind speed in m s-1




6 Session info


Print version of information about R, the OS and attached or loaded packages which was used. It is only for information purposes and will increase reproducibility.

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] randomForest_4.6-14 dismo_1.1-4         raster_3.0-7       
## [4] maptools_0.9-9      sp_1.3-2           
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3       codetools_0.2-15 lattice_0.20-35  digest_0.6.23   
##  [5] grid_3.5.1       magrittr_1.5     evaluate_0.14    rlang_0.4.2     
##  [9] stringi_1.4.3    rmarkdown_2.0    rgdal_1.4-8      tools_3.5.1     
## [13] foreign_0.8-70   stringr_1.4.0    xfun_0.11        yaml_2.2.0      
## [17] compiler_3.5.1   htmltools_0.4.0  knitr_1.26