Species richness of butterflies in Pakistan

Species Distribution Modeling with BIOCLIM

1. Introduction

The following tutorial is part of the project seminar “Species Distribution Modeling” at Philipps-University Marburg. The aim of this tutorial is to present the basic features of the algorithm BIOCLIM by modeling a prediction of the specie´s distribution of butterflies and creating a species richness map for the country Pakistan.

2. The Theory behind BIOCLIM

BIOCLIM is a classic envelope-style method and belongs to the profile methods, who constitute the oldest family of SDM algorithms (Elith et al. 2006). It was invented as an collaborative project of CSIRO Water and Land Resources and the Federal Government’s Bureau of Flora and Fauna by mainly Henry Nix, John Busby and Michael Hutchinson (Booth 2018; Busby 1991; Nix 1986).

It was the first species distribution modelling package that linked species occurrence data with maps of environmental variables spatially. So the current distribution of a species is the indicator of the climatic requirements (Booth et al. 2014). The climate variables will be correlated with another in the observed distributions. Such correlative models identify the fundamental niche through the modeling of the limiting mechanisms of the climatic requirements of the species (Babar et al. 2012).

An envelope-style method defines a multi-dimensional environmental space as the potential range where a species can occur (Elith et al. 2006). It is bounded by the maximum and minimum values of the environmental factors (Beaumont et al. 2005). The logic is very similar to Hutchinson´s niche concept (Booth et al. 2014).
The following figure should give you a better understanding for the mechanism behind the algorithm:

Figure 1: The multi-dimensional rectilinear envelope. Source: Biodiversity and Climate Change Virtual Laboratory 2016.

Figure 1: The multi-dimensional rectilinear envelope. Source: Biodiversity and Climate Change Virtual Laboratory 2016.

Environmental data values at the locations of species occurrence are treated as multiple one-tailed percentile distributions; it creates a percentile distribution for each variable (the 5th percentile is treated the same as the 95th percentile) (Hijmans & Graham 2006).

3. Data Information

The basic idea behind species distribution models is to take two sources of information to model the conditions in which a species is expected to occur. BIOCLIM uses presence records only and a set of environmental predictors (e.g. precipitation, temperature) across a pre-defined landscape that is divided into grid cells.

3.1 Occurence Data

Usually occurence data consists of latitude and longitude geographic coordinates where the species of interest has been observed. These are known as ‘presence’ data. Some models also make use of ‘absence’ data, which are geographic coordinates of locations where the species is known not to occur.

For this tutorial, we will use the occurrence data of the butterfly species that you can download here; save it in the data folder that you will create in the next work step. The csv.file contains 429 butterfly species (3 columns: the species names and coordinates x, y). Species with too little or no records were discarded automatically by the model and were not included into the modelling process. Any occurence outside the range of Pakistan was discarded as well.

3.2 Environmental Dataset

These are descriptors of the environment, and can include abiotic measurements such as temperature or precipitation, or biotic factors like the presence/absence of other animals. We will focus on abiotic factors available from WorldClim. After you downloaded the raster files, you need to crop them to the extent of Pakistan. This step is not part of this tutorial, because this was already done beforehand. You can do this with the function crop(predictors, extent(Pakistan)) and store it in a new variable.

Butterflies have been recognized as being highly sensitive to climate and being affected by the amount of energy available. The ectothermic behavior of butterflies leads to a high dependency on warm temperature and solar radiation; their flight muscles need to maintain a temperature above ambient in order to work efficently (Turner et al. 1987). Wind speed is essential for every flying animal. According to Professor Robert Dudley, head of the Animal Flight Laboratory at U.C. Berkeley, an animal cannot fly if the wind speed is higher than the maximum flight speed. Precipitation as well as water vapor seem to have an important effect on species richness as well (Checa et al. 2019).

Table 1: The environmental variables and their units.
Predictor_name Unit
prec precipitation in mm
temp_min temperature in °C
temp_max temperature in °C
temp_avg temperature in °C
srad solar radiation in kJ m-2 day-1
vapr water vapor pressure in kPa
wind wind speed in m s-1

4. Preprocessing input data

This tutorial consists of the following work steps:

  • obtaining and preprocessing of all input data (species presence data and data for environmental predictor variables)
  • creation of the model from the input data
  • using the model to create a prediction based on the species occurrence points and the predictor variables
  • thresholding and creating species richness map
  • evaluating the performance of the BIOCLIM model

Set your working directory. You can check your current working directory with the function getwd():

4.1 Loading required packages

Load the required packages with install.packages() and activate them with library(). The BIOCLIM function is implemented in the dismo package (Hijmans et al. 2020). You can learn more about that package here: https://cran.r-project.org/web/packages/dismo/dismo.pdf.

4.2 Environmental data

We are loading a vector map showing the country boundary of Pakistan and the environmental variables.

For the next step you need the package “USDM”. We are using this function to exclude highly correlated data in our predictor dataset. This is an important step when you want to model a prediction. If you are interested in learning more about this, you can start with this article: https://towardsdatascience.com/why-exclude-highly-correlated-features-when-building-regression-model-34d77a90ea8e.

The Variance Inflation Factor excluded average, minimum and maximum temperature (temp_avg, temp_min, temp_max) and precipitation, solar radiation, water vapor and wind speed remained:

## 3 variables from the 7 input variables have collinearity problem: 
##  
## temp_avg temp_min temp_max 
## 
## After excluding the collinear variables, the linear correlation coefficients ranges between: 
## min correlation ( wind ~ prec ):  -0.03884336 
## max correlation ( vapr ~ srad ):  0.7153276 
## 
## ---------- VIFs of the remained variables -------- 
##   Variables      VIF
## 1      prec 1.427669
## 2      srad 2.823217
## 3      vapr 2.792292
## 4      wind 1.362512

Interpreting the Variance Inflation Factor:

  • 1 = not correlated

  • between 1 and 5 = moderately correlated

  • greater than 5 = highly correlated

Now we exclude the collinear variables that were identified in the previous step:

Our remaining predictors are:

4.3 Get the butterflies going

We are loading the important species presence points (BIOCLIM is a presence-only method! If you wanna know more about presence/absence/pseudo-absence/background-data you can start with reading the tutorial of Damaris Zurell: https://damariszurell.github.io/EEC-SDM/5_pseudoabsence.html) and prepare them for further steps.

Next step is to turn the data frame (table-like data) into spatial points. We need this for our bioclim()-function.

For the later evaluation of the model, we need to divide the occurence data into training and test data. In our case, we will randomly withhold 30% of the observations as test data and retain the other 70% as training data. We will fit the model with the training data and evaluate it with the test data.

First plotting:

5. Species distribution modelling with Bioclim

We are ready to fit the model. We simply need the predictors and the occurrence points; the model does the extraction.

How does the likelihood of species occurrence responds to variation in these climatic variables?

The response plots show, how the probability of the species occurring (y-axes, from zero to one) varies with each of the environmental predictors (x-axes).

So what does this model predict about the distribution of favorable habitats across Pakistan?
For the predict model (also from the dismo package) you only need the raster stack with our predictors and the BIOCLIM model.

So let´s see how it looks:

This is now the prediction map and its species distribution prediction based on the occurence data and the selected climate variables. In the BIOCLIM model the predicted values range from 0 to 1, whereas 1 indicates a location with a median value of all covariates, and 0 locations where at least one covariate is outside the range of environmental covariates at presence location.

As you can see, the habitat is predicted to be good in areas where there are lots of species occurrences. This is not too surprising, since the data points were used to fit the model. However, note that there are also large expanses of habitat that is predicted to be less good, but where observations occur:

6. Species Richness Map

As a final step we are going to create a species richness map of butterflies in Pakistan indicating the number of species per pixel. We therefore need to decide on a threshold for a species presence/absence map. Once all species are converted to a binary (presence/absence), we have to stack all the rasters to get a richness map.

The following code is adopted from Cecina Babich Morrow. Here she is explaining how to achieve a threshold (either minimum training presence (MTP) or 10th percentile training present (P10)).

This is the base loop function:

Now testing the function for one species (e.g. Acraea issoria):

Now we are creating a loop, that does the calculation for every single species. A loop is a way to repeat a sequence of instructions automatically. The basic format looks similar to this:

for(i in X){ … }

where i keeps track of the index, X is the set of indicies to use, and … is where you place the code to be run for each.


Quick remark:
R stores objects in virtual memory and there are limits based on the amount of memory that can be used by all objects. If you get the error message: “cannot allocate vector of size…” you can try this:


So let´s loop…

Create raster stack from presence/absence data; each species is one layer.

Our species richness map containing the sum of the occurring species for each cell:

And here for comparison the richness map again but with the occurrence points:

7. Evaluation of the model performance

How well does the model actually do at predicting the occurrence of butterflies in our study area?

To evaluate the predictive accuracy of the model, we turn back to our test data, and use cross-validation to test the model. For that, we use an approach called the Area Under the Receiver Operator Curve (AUC). If you are not familiar with Receiver Operator Curve (ROC) and AUC, you can read something about it here.

In Figure 2 you can see a scale for interpreting the ROC plot and the AUC value:

Figure 2: Interpretation of the AUC value. Source: Biodiversity and Climate Change Virtual Laboratory (BCCVL) 2021.

Figure 2: Interpretation of the AUC value. Source: Biodiversity and Climate Change Virtual Laboratory (BCCVL) 2021.

The AUC procedure can be adapted for presence only data by generating random “pseudo-absence”-points within the region. So we first generate background points for the pseudo-absence-points:

Then we can use “evaluate()” to generate several diagnostics as well as the AUC, using our test data as our presences against our pseudoabsences.

## class          : ModelEvaluation 
## n presences    : 1522 
## n absences     : 1000 
## AUC            : 0.752112 
## cor            : 0.3819793 
## max TPR+TNR at : 0.1057516

This is what the ROC (and AUC) looks like as a plot:


According to Figure 2, our model appears to do a little bit better than random guessing with an AUC of 0.752112 (but not more). If you are looking at other studies about model performance in SDM, you will see that BIOCLIM performs most of the time rather poorly compared to other algorithms (Babar et al. 2012; Hernandez et al. 2006; Tsoar et al. 2007). It is still a solid algorithm for species distribution modelling and good for teaching purposes because of its simple structure.

8. Summary

  • implemented in the package “dismo” (Hijmans et al. 2020)
  • a profile method (envelope-style)
  • presence-only (no absence or backround data needed)
  • defines a multi-dimensional environmental space in which a species can occur
Figure 3: The bioclim()-function. Source: Hijmans et al. 2020.

Figure 3: The bioclim()-function. Source: Hijmans et al. 2020.

Cons:

  • it does not perform as good as other algorithms
  • assumes that species distribution is defined by environmental factors alone

To get an overview of the work steps that were done in this tutorial, here the final script:

# set work directory
#-------------------

setwd("C:/Users/Mona Hallenberger/Uni/Projekt_SDM_Pakistan")
filepath <- ("C:/Users/Mona Hallenberger/Uni/Projekt_SDM_Pakistan")

# setting the Data folder
data <- paste0(filepath, "/Data/")

# setting the Output folder
output <- paste0(filepath, "/Output/")
getwd()

# loading required packages
#--------------------------

#install.packages('dismo')
library(dismo)        # includes raster
library(maptools)     # includes sp
library(rgdal)        #for shapefiles
library(usdm)

# load supplementary and prediction data
#---------------------------------------

# vector map showing the country boundary of Pakistan
pakistan <- getData("GADM", country = "Pakistan", level = 0)

# predictors data Temp (max, min), Prec, wind (raster)
env_files <- list.files(path = paste0(data,"environ_variables"), full.names=TRUE)
preds <- stack(env_files)

#detects variables with high correlation based on calculating variance inflation factor (VIF) statistics
v1 <- vifstep(preds)
v1

#exclude the collinear variables that were identified in the previous step
predictors <- exclude(preds, v1) 

################ get the butterflies going #####################
#---------------------------------------------------------------

# load butterfly dataset
#-----------------------
butterflies_csv <- read.csv(file="C:/Users/Mona Hallenberger/Uni/Projekt_SDM_Pakistan/Data/Butterfly_Data.csv", header=TRUE, sep=',')

#copy the data to new object, because it's needed later on
butterflies <- butterflies_csv

butterflies$species <- as.factor(butterflies$species)

#needed for loop later
list_butterflies <- levels(butterflies$species) 

# get coordinates to dataset
#---------------------------

#this step turns the data frame (table-like data) into spatial points
coordinates(butterflies) <- c("x","y")
points <- crop(butterflies, pakistan)

# split into training and test data set
#--------------------------------------
#caret is a strong machine learning library including important functions for data splitting
library(caret)

set.seed(123)  

#split data while keeping original distribution per variable
points_df <- as.data.frame(points)
train_id <- createDataPartition(points$species, times = 1, p = 0.7, list = FALSE)

sp_train <- points_df[train_id,]
sp_test <- points_df[-train_id,]

# first plotting
#---------------
plot(predictors, 1, xlab = "Latitude", ylab = "Longitude", main="Precipitation", col = terrain.colors(50)) #plot 1st predictor
plot(pakistan, add = TRUE, border="dark grey")
points(points, pch=19, cex=0.1, col="black")

################# set up BioClim model #######################
#-------------------------------------------------------------

#model fitting
bc <- bioclim(predictors, sp_train[,-1])
plot(bc,a=1, b=2, p=0.85)

#response curves
response(bc)

#prediction
pb <- predict(predictors, bc)

### plot prediction map ###
#--------------------------
plot(pb, main= "Butterfly Prediction Map",col = terrain.colors(50))
plot(pakistan, add = TRUE, border="dark grey")

########### species richness map ################

##### thresholding for one spezies ####
#--------------------------------------------------------
#computes the probability for the appearance of one species into a binary presence/abscence map; this is just the base function that is applied further down

sdm_threshold <- function(sdm, occs, type = "mtp", binary = FALSE){
  
  # parameter
  # ---------
  # sdm = prediction from model (BioClim)
  # occs = subset species
  # type = type of threshold; either minimum training presence (mtp) or 10th percentile training presence
  # binary = logical; binary or not
  #----------
  
  occPredVals <- raster::extract(sdm, occs)   #extract species from model
  
  
  if(type == "mtp"){
    thresh <- min(na.omit(occPredVals))
   
   
  } else if(type == "p10"){
    if(length(occPredVals) < 10){
      p10 <- floor(length(occPredVals) * 0.9)
    } else {
      p10 <- ceiling(length(occPredVals) * 0.9)
    }
    thresh <- rev(sort(occPredVals))[p10]  #sort in reverse
  }
  
  #set pixels smaller than found threshold to NA
  sdm_thresh <- sdm
  sdm_thresh[sdm_thresh < thresh] <- NA
  
  #make it binary by assigning 1
  if(binary){
    sdm_thresh[sdm_thresh >= thresh] <- 1
  }
  return(sdm_thresh)
}

####### test function for one species ######
#-------------------------------------------
#copy prediction results to new object
butterfly_pred <- pb
butterfly_occs <- subset(butterflies,butterflies$species == 'Acraea_issoria') #e.g.
butterfly_mtp <- sdm_threshold(butterfly_pred, as.data.frame(butterfly_occs)[,2:3], "p10")
#binary=TRUE --> presence/absence map

plot(butterfly_mtp)

################## loop ######################
#---------------------------------------------
#the function above is implemented in a loop in order to generate a presence/absence map for each species; at the end everything´s going to get stacked and summed up for each cell

#create new and empty list objects to store output in
list_results=list()
list_binary=list()

for (i in list_butterflies) {
  
  memory.limit(9999999999)
  
  #subset the data by species 
  data_butterfly <- subset(butterflies_csv,butterflies_csv$species == i)
  
  #keep just column 2 and 3
  data_butterfly <- data_butterfly[,2:3]
  
  #store prediction results in new
  butterfly_pred <- pb
  
  #again subset by species; now i represents one species for each loop run
  butterfly_occs <- subset(butterflies_csv,butterflies_csv$species == i)
  
  #apply thresholding now using minimum training presence -> "mtp" to "p10" if you want 10th percentile 
  butterfly_mtp <- sdm_threshold(butterfly_pred, as.data.frame(butterfly_occs)[,2:3], "mtp", binary=TRUE)
  
  #store result in list object; [i] indicates the index in the list to store the data in
  list_binary[i]<- butterfly_mtp
  
  #same, but wuith prediction results
  list_results[i]<-pb
  
}

#create raster stack from presence/absence data; each species is one layer
butterfly_stack <- stack(list_binary)

#creates sum for each pixel of the stack
results <- stackApply(butterfly_stack,indices=c(1),fun=sum, na.rm=TRUE)

#plot final species richness map
plot(results,main="Species Richness Map")
plot(pakistan, add = TRUE, border="dark grey")


########### evaluation of model performance ##############
#---------------------------------------------------------

# validating the model with AUC
bg <- randomPoints(predictors, 1000)

#evaluation --> AUC
eva <- evaluate(bc,p=sp_test, a=bg, x=predictors)

plot(eva, 'ROC')

References

Babar, S., Amarnath, G., Reddy, C.S., Jentsch, A. & S. Sudhakar (2012): Species distribution models: ecological explanation and prediction of an endemic and endangered plant species (Pterocarpus santalinus L.f.). Current Science vol. 102(8): 1157-1165.

Beaumont, L. J., Hughes, L., & M. Poulsen (2005): Predicting species distributions: use of climatic parameters in BIOCLIM and its impact on predictions of species’ current and future distributions. Ecological modelling 186(2): 251-270.

Biodiversity and Climate Change Virtual Laboratory (BCCVL) (2021):SDM-Interpretation of model outputs.
https://support.bccvl.org.au/support/solutions/articles/6000127046-sdm-interpretation-of-model-outputs

Biodiversity and Climate Change Virtual Laboratory (BCCVL) (2016): Algorithm Information (SDMs): BIOCLIM.
https://support.bccvl.org.au/support/solutions/articles/6000083201-bioclim

Booth, T.H., Nix, H.A., Busby J.R. and M.F. Hutchinson (2014): BIOCLIM: the first species distribution modelling package, its early applications and relevance to most current MAXENT studies. Diversity and Distributions 20: 1-9.

Booth, T.H. (2018): Why understanding the pioneering and continuing contributions of BIOCLIM to species distribution modelling is important. Austral Ecology 43: 852–860.
https://onlinelibrary.wiley.com/doi/epdf/10.1111/aec.12628

Busby, J.R. (1991): Bioclim – a bioclimatic analysis and prediction system. Plant Protection Quarterly 6: 8–9.

Checa, M.F., Levy, E., Rodriguez, J. & K. Willmott (2019): Rainfall as a significant contributing factor to butterfly seasonality along a climatic gradient in the neotropics. bioRxis. The reprint server for biology.
https://doi.org/10.1101/630947

Elith, J., Graham, C.H., Anderson, R.P., Dudik, M., Ferrier, S., Guisan, A., Hijmans, R.J., Huettmann, F., Leathwick, J., Lehmann, A., Li, J., Lohmann, L.G., Loiselle, B., Manion, G., Moritz, C., Nakamura, M., Nakazawa, M., Overton, J.McC, Peterson, A.T., Phillips, S., Richardson, K., Scachetti-Pereira, R., Schapire, R., Soberon, J., Williams, S., Wisz, M. & N. Zimmerman (2006): Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29: 129-151.
https://dx.doi.org/10.1111/j.2006.0906-7590.04596.x

Hijmans R.J., & C.H. Graham (2006): Testing the ability of climate envelope models to predict the effect of climate change on species distributions. Global change biology 12: 2272-2281.
https://dx.doi.org/10.1111/j.1365-2486.2006.01256.x

Hernandez, P.A., Graham, C.H., Master, L.L. & D.L. Albert (2006): The effect of sample size and species characteristics on performance of different species distribution modeling methods. Ecography 29: 773-785.
https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.0906-7590.2006.04700.x

Nix, H.A. (1986): A biogeographic analysis of Australian elapid snakes. In: Atlas of Elapid Snakes of Australia. Australian Flora and Fauna Series 7: 4–15. Canberra.

Tsoar, A., Allouche, O., Steinitz, O., Rotem, D. & R. Kadmon (2007): A comparative evaluation of presenceonly methods for modelling species distribution. Diversity and Distributions 13: 397–405.
https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1472-4642.2007.00346.x

Turner, J.R.G., Gatehouse, C.M. & C.A. Corey (1987): Does Solar Energy Control Organic Diversity? Butterflies, Moths and the British Climate. Oikos 48 (2): 195-205.

R packages

Bivand, R. & N. Lewin-Koh (2021). maptools: Tools for Handling Spatial Objects. R package version 1.1-1.
https://CRAN.R-project.org/package=maptools

Bivand, R., Keitt, T. & B. Rowlingson (2021). rgdal: Bindings for the ‘Geospatial’ Data Abstraction Library. R package version 1.5-23.
https://CRAN.R-project.org/package=rgdal

Hijmans, R.J., Phillips, S., Leathwick, J. & J. Elith (2020). dismo: Species Distribution Modeling. R package version 1.3-3.
https://CRAN.R-project.org/package=dismo

Kuhn, M. (2021). caret: Classification and Regression Training. R package version 6.0-88.
https://CRAN.R-project.org/package=caret

Naimi B., Hamm N.A.S., Groen T.A., Skidmore A.K. & A.G. Toxopeus (2014): Where is positional uncertainty a problem for species distribution modelling. Ecography 37: 191-203.
https://doi.org/10.1111/j.1600-0587.2013.00205.x

Session Info

R environment session info for reproducibility of results:

## R version 4.0.5 (2021-03-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2 caret_6.0-88       ggplot2_3.3.3      lattice_0.20-41   
##  [5] usdm_1.1-18        rgdal_1.5-23       maptools_1.1-1     dismo_1.3-3       
##  [9] raster_3.4-5       sp_1.4-5           knitr_1.31         png_0.1-7         
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6           lubridate_1.7.10     class_7.3-18        
##  [4] assertthat_0.2.1     digest_0.6.27        ipred_0.9-11        
##  [7] foreach_1.5.1        utf8_1.2.1           R6_2.5.0            
## [10] plyr_1.8.6           stats4_4.0.5         evaluate_0.14       
## [13] highr_0.8            pillar_1.5.1         rlang_0.4.10        
## [16] data.table_1.14.0    jquerylib_0.1.4      rpart_4.1-15        
## [19] Matrix_1.3-2         rmarkdown_2.7        splines_4.0.5       
## [22] gower_0.2.2          stringr_1.4.0        foreign_0.8-81      
## [25] munsell_0.5.0        compiler_4.0.5       xfun_0.22           
## [28] pkgconfig_2.0.3      rgeos_0.5-5          htmltools_0.5.1.1   
## [31] nnet_7.3-15          tidyselect_1.1.0     prodlim_2019.11.13  
## [34] tibble_3.1.0         bookdown_0.22        codetools_0.2-18    
## [37] fansi_0.4.2          crayon_1.4.1         dplyr_1.0.5         
## [40] withr_2.4.1          ModelMetrics_1.2.2.2 MASS_7.3-53.1       
## [43] recipes_0.1.16       grid_4.0.5           nlme_3.1-152        
## [46] jsonlite_1.7.2       gtable_0.3.0         lifecycle_1.0.0     
## [49] DBI_1.1.1            magrittr_2.0.1       pROC_1.17.0.1       
## [52] scales_1.1.1         rmdformats_1.0.2     stringi_1.5.3       
## [55] reshape2_1.4.4       timeDate_3043.102    bslib_0.2.5.1       
## [58] ellipsis_0.3.1       generics_0.1.0       vctrs_0.3.7         
## [61] lava_1.6.9           iterators_1.0.13     tools_4.0.5         
## [64] glue_1.4.2           purrr_0.3.4          survival_3.2-10     
## [67] yaml_2.2.1           colorspace_2.0-0     sass_0.4.0