Species distribution modelling of butterflies in Pakistan

Modelling by using Neural Network

Introdution

General

In the tutorial at hand the distribution of multiple butterfly species across Pakistan will be modeled. Since the distribution of the species goes hand in hand with the occurring environmental factors, such as the wind and precipitation, these impacts will be considered and included into the calculations. The method utilized is neural networks.

What is Neural Network

The Neural Network method is inspired by the human brain and thus mimics the way that biological neurons signal to one another. Since this method is part of the machine learning methods there are a lot of systems e.g. working machines such as Google that rely on its way of functioning and thus it is a widely spreaded method in todays world.

A subset of the machine learning methods is the Neural Network, which is furthermore classified as a deep learning method. Both are subcategories of artificial intelligence. As the name machine learning indicates, the method can be trained by running through pre-tests. Once learned, the machine can generalize and then apply this knowledge to new data. The main difference between deep learning and machine learning consists in the way information are processed: Machine Learning needs structured Data, whereas the Neural Network can process huge datasets, which are partly non-structured - e.g. can transform pictures into numeric data. The more data that is transferred, the more accurate the result will be.

Especially the artificial Neural Network (ANN) are interesting for multiple Applications using e.g. face-/voice recognition.

Here is an example:

A human brain recognizes a person although their appearance is changing constantly, even if it sees only a small part of the face, like the eyes. A situation, which we had in the last two years a lot. A machine on the other hand usually needs the exact same information (photo). With the Neural Network we can train a machine to recognize a person in different movements and appearances, just as the human brain does.

Wuttke, L. (n.D.): can be found here

The developing ideas started in 1943 when Warren McCulloch and Walter Pitts connected some neurons to something resembling a net. Ever since then the development of the neural networks kept on going and since 2009 it is widely used, especially for tasks which are more demanding and contain a huge set of data. Since it is known for their efficient way of solving real world problems it is today one of the much extensively researched areas in computer science.

How does Neural Network works

There are varies kinds of Neural Network existing, but the basic structure stays the same: It consists of multiple layers, from which we can only see two: the input layer and the output layer. In between these two layers are further layers but these are invisible and thus are called the hidden-layers. These layers are essential for the functioning of the method. Furthermore there is to say, that each layer consists of several nodes. Each node is connected to another in the neighboring layers and has an associated weight and threshold.

IBM 2020: can be found here

Some mathematics

Ibrahim, A. et. al (2016): can be found here

Before starting, the impact of the different input layers is taken in consideration in order to assign the weight to the variable. All input-layers are then multiplied by their respective weights, the products are summed. Afterwards, the result is determined by passing the data through an activation function for limiting the amplitude of the output to a finite value. The nodes of the given layer are sending data to the next layer only if the output of any individual node is above the threshold value, otherwise no data is transmitted.

\[ \Large output=f(x) \left\{ \begin{matrix}\ 1\ if \sum_{i=1}^{m}w_i*x_i \geq 0 \\ 0\ if \sum_{i=1}^{m}w_i*x_i < 0 \\ \end{matrix} \right. \]

Some exemplary activation functions:

Rahul Jayawardana, J.K. & T. Sameera Bandaranayake (2021): can be found here

Modeling species richness

Used Packages

First of all, we need to load several libraries. We especially need the SSDM package, because we are going to use this to model our data.

library(SSDM)
library(raster) 
library(data.table) 
library(jpeg)
library(plyr)

During the workflow we need the path to our working directory several times. To make this step easier, we will store this path in the wd variable.

wd <- setwd("C:\\Users\\Hannes\\OneDrive\\4. Semester\\Species Distribution Modelling\\SDM_MaHa\\MaHa_Arbeitsordner")

Read in data

Read in environment data

As mentionned above, we are utilizing the SSDM package. This package brings along multiple functions that take over some data preparations, which it is executing in the background.

To begin with, we load in some environmental data. Since we are part of the SDM course, we are getting the specific data for our study area, which is Pakistan. Thus, we do not have to crop them. The data are available as GeoTif. We chose to use the precipitation, solar radiation, average temperature and wind speed data, in order to depict the living conditions for the species we are going to analyse later on. Additionally we download the elevation data and store it together with the other environmental data. You can download the raw data here. Otherwise you can use e.g. the getData function from the raster package.

elev <- getData("alt", country="PAK", path = file.path(wd, "data\\Environment\\"))
predictors <- load_var(file.path(wd, "\\data\\Environment\\"))
## Variables loading 
## Variables treatment 
##    resolution and extent adaptation...   done... 
##    normalizing continuous variable

The special thing about the load_var function is, that it can work with different data which are not necessarily provided in the same datatypes. In our case the data is provided in two different datatypes (TIF and GRI). Additionally it adapts the resolution and the extent of the data and normalizes their continuous variables.

names(predictors)[1] <- c("elev")
predictors
## class      : RasterStack 
## dimensions : 1602, 1943, 3112686, 5  (nrow, ncol, ncell, nlayers)
## resolution : 0.008333333, 0.008333333  (x, y)
## extent     : 60.85, 77.04167, 23.7, 37.05  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +ellps=WGS84 +no_defs 
## names      :         elev,         prec,         srad,     temp_avg,         wind 
## min values : -0.001820830,  0.009259259,  0.338035049, -2.090909234,  0.048192771 
## max values :            1,            1,            1,            1,            1

Let us check the predictors.

plot(predictors)

The function stacks the data (RasterStack). Furthermore, they are presented in the WGS84 coordinates system.

Load occurrence data

Our subject of interest are varies butterfly species in Pakistan and their distribution. In the following the occurrence data of these species will be read in from csv data using the load_occ function from the SSDM package. The data was delivered by our course instructor. We hand over the environmental data into the function because it removes automatically all occurrences which are not in the extent of the RasterStack.

occ <- load_occ(file.path(wd, "data"), Env = predictors, Xcol = "x", Ycol = "y", Spcol = "species", file = "PakistanLadakh.csv", sep = ",", dec = ".")
head(occ)

In order to facilitate the readability of the given DataFrame, we change the names of the columns.

names(occ) <- c("species", "lon", "lat")

To get a better overview about our data, we look at the specific species and how often they have been spotted.

spec_names <- unique(occ$species)
cat("There are", format(length(spec_names)), "different species. Overall there are", format(length(occ$species)), "entries, so that we have an average of", format(length(occ$species)/length(spec_names)), "entries per species.")
## There are 419 different species. Overall there are 5725 entries, so that we have an average of 13.66348 entries per species.
bd <- count(occ$species)
hist(bd$freq, main = "Frequency occurence of butterflies", xlab = "Frequency", ylab = "Species", cex.lab=1.2, cex.main=1.5, col="cadetblue2")
abline(v = 15, col = "black", lwd = 3, lty = 2)

As you can see in the Histogram above there are a lot of Species that have not been spotted often and it does not really make sense to model all of them. Thus, our modelling is limited to the Species that have been spotted at least 15 times. The dotted line marks the border.

We build a small loop, to extract the species which were spotted at least 15 times.

species_list <- list() ## creates empty list

for (a in spec_names) ## loop for each species 
  
  if (length(occ[occ$species == a, 1]) >=15) ## if species occurs at least 15 times
    {
    species_list[[a]] <- occ[occ$species == a,] ## than the species will be saved into our empty list
  }

species <- names(species_list) ## save the different species names

occ <- rbindlist(species_list) ## Overwrite original occurrence data with our created condition

We check how many species are left over

length(unique(occ$species))  
## [1] 144

As we can see there are only 144 species that have been spotted at least 15 times.

Get extra Data - Administratives Border of Pakistan

border <- getData("GADM", country="Pakistan",level=0, path = file.path(wd, "data/"))

The SpatialPointDataFrame allows us to plot the occurrence data into a map. The data is assigned the same CRS as our environmental maps (WGS84).

occ_spdf <- SpatialPointsDataFrame(cbind(occ$lon, occ$lat), occ, proj4string = CRS("+init=EPSG:4326"))

Now we create the map with our occurence points.

plot(predictors$elev, main = "Species Records\n in Pakistan")
plot(border, add=TRUE)
points(occ_spdf, pch="+", cex=0.3, col="black")

First Model

Now were are ready to try modelling our data. For the first model we just use one single species in order to experiment a little and get an overview about the methods way of functionning. The SSDM package provides an own modelling function, which only needs the method, the occurence data, the environmental data and the coloums for the coordinates. The algorithm we use is ANN so that the function use the nnet Package in the background. Hence, we use Neural Network for our model.

Colias Erate occours most often in our Data

BioLib.cz can be found here
sdm_test <- modelling("ANN", subset(occ, occ$species == species[1]), Env = predictors, Xcol = "lon", Ycol = "lat")

Since we did not construct absence points, the function creates random points for us. Since the random points were not created systematically, every time we run the model it deliverers a different output. In order to guarantee reproducibility we need to create our own absence points later. But first, let us see the output of the first modelling.

plot(sdm_test@projection, main = "Distribution of out first species\n without own random Points")
plot(border, add=TRUE)

Additionally we can check the importance of our environmental variables. As we can see the solar radiation is the most important one. The rest have similar importance.

sdm_test@variable.importance
##                     elev     prec     srad temp_avg     wind
## Axes.evaluation 14.79386 16.93978 39.85013 15.41271 13.00351

Before we can create our absence points, our occurrence DataFrame needs an extra column for the presence/absence information.

occ$presence <- 1

Create Absence Points

Now we can create the absence Points.

## creates SPDF - needs Raster object (here: predictors) - Benefit: absence Points have same extent than our environmental data.
## For each presence point we create one absence point
absence_points <- sampleRandom(x=predictors, length(occ$species), sp=T) 
## transform our points into a DataFrame
absence_points <- raster::as.data.frame(absence_points)
## we add a column for the species
absence_points$species <- occ$species
## like above we need to add the column for presence/absence information. Here 0 for absence)
absence_points$presence <- 0
## remove unnecessary columns
absence_points <- absence_points[-c(1:4)]
## order the columns into the same way as our occ df
absence_points <- absence_points[,c(3,1,2,4)]
## add the same names for the columns as our occ df
names(absence_points) <- c("species", "lon", "lat", "presence")

We create our absence points only one time and save them as a .csv file. So they will not be generated again. Afterwards we read in the DataFrame.

write.csv(absence_points, file.path(wd, "output/absence_points.csv"), row.names=FALSE) 
absence_points <- read.csv(file.path(wd, "output/absence_points.csv"))
head(absence_points)
##                species      lon      lat presence
## 1 Aglais_caschmirensis 65.81250 27.09583        0
## 2 Aglais_caschmirensis 74.26250 35.93750        0
## 3 Aglais_caschmirensis 62.87083 27.67917        0
## 4 Aglais_caschmirensis 68.72917 24.62083        0
## 5 Aglais_caschmirensis 68.17917 28.23750        0
## 6 Aglais_caschmirensis 68.89583 31.02917        0
head(occ)
##                 species      lon      lat presence
## 1: Aglais_caschmirensis 71.13964 33.82069        1
## 2: Aglais_caschmirensis 71.62838 35.92580        1
## 3: Aglais_caschmirensis 71.69917 35.81516        1
## 4: Aglais_caschmirensis 71.70334 35.87940        1
## 5: Aglais_caschmirensis 71.70810 35.98351        1
## 6: Aglais_caschmirensis 71.77320 36.15080        1

Combine presence and absence Points

Now we create a new DataFrame which combines the DataFrames we prepared before.

final_data <- rbind(occ, absence_points)
final_data <- final_data[order(final_data$species),]

Second Model

Now we can try out our model from above with our optimized data. We just add the argument Pcol. With that argument we hand over the information whether the occurrence points are presence or absence points.

sdm_test_absence <- modelling("ANN", subset(final_data, final_data$species == species[1]), Env = predictors, Xcol = "lon", Ycol = "lat", Pcol = "presence")
plot(sdm_test_absence@projection, main = "Distribution of out first species\n with own random Points")
plot(border, add=TRUE)

Function by Cecina Babich Morrow

We are close to modelling all species with a loop. But therefore we want to implement threshold function.

The function thresholds our model by using the given threshold, which is either the minimum training presence method or the 10th percentile method. By thresholding the model with the minimum training presence (MTP) the lowest predicted suitability value for an occurrence point is discovered, since it looks at the least suitable habitat at which the species is known to occur (taken from the given data). Then assumes it to be the minimum suitability value for the species. Alternatively, the 10th percentile training presence (P10) is applied. That threshold omits all regions where the suitability value is lower than for the lower 10% of the occurrence records for the species. Thus, the P10 threshold omits a greater region than the MTP threshold.

For more information about the function you can click here

sdm_threshold <- function(sdm, occs, type = "mtp", binary = FALSE){
  occPredVals <- raster::extract(sdm, occs)
  if(type == "mtp"){
    thresh <- min(na.omit(occPredVals))
  } else if(type == "p10"){
    if(length(occPredVals) < 10){
      p10 <- floor(length(occPredVals) * 0.9)
    } else {
      p10 <- ceiling(length(occPredVals) * 0.9)
    }
    thresh <- rev(sort(occPredVals))[p10]
  }
  sdm_thresh <- sdm
  sdm_thresh[sdm_thresh < thresh] <- NA
  if(binary){
    sdm_thresh[sdm_thresh >= thresh] <- 1
  }
  return(sdm_thresh)
}

Lets model all species

First we create an empty list which we need in the next step.

Binary=list()

Now we are going to built a small loop and model each species.

for (a in species) {
  data <- modelling("ANN", subset(final_data, final_data$species == a), Env = predictors, Xcol = "lon", Ycol = "lat", Pcol = "presence")
  
  ## save the resulting projection
  sdm <- data@projection
  
  occ_final <- subset(occ, occ$species == a)
  ## use the threshold function
  distribution <- sdm_threshold(sdm, occ_final[,2:3], "p10", binary = TRUE)
  ## store all results into the empty list
  Binary[a] <- distribution
}

We stack all resulted Layer.

Binary_stack <- stack(Binary)
species_richness <- stackApply(Binary_stack, indices = c(1), fun = sum, na.rm = TRUE)
plot(species_richness, main="Species richness of butterflies in Parkistan", ylab="Latitude", xlab="Longitude")
plot(border, add=T)

Refercenes

BioLib.cz (n.D.): http://BioLib.cz (Access: 29.06.2022)

Dongare, A. D., Kharde, R.R. & Amit D. Kachare (2012): Introduction to Artificial Neural Network. International Journal of Engineering and Innovative Technology (IJEIT) 2/1 189-194.

IBM (2020): Neurale Netzwerke: https://www.ibm.com/de-de/cloud/learn/neural-networks (Access: 23.06.2022)

Ibrahim, A. et. al (2016): Control Systems in Robotics: A Review. International Journal of Engineering Inventions 5/5: 29-38.

Morrow, C., B. (2019): Thresholding species distribution models: https://babichmorrowc.github.io/post/2019-04-12-sdm-threshold/ (Access: 20.06.2022)

Rahul Jayawardana, J.K. & T. Sameera Bandaranayake (2021): Analysis of optimizing Neural Networks and artificial intelligent models for guidance, control and navigation systems. International Research Journal of Modernization in Engineering, Technology and Science 03/03: 743-759.

Wuttke, L. (n.D.): Deep Learning: Definition, Beispiele & Frameworks: https://datasolut.com/was-ist-deep-learning/ (Access: 29.06.2022)

Session Info

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] plyr_1.8.7        jpeg_0.1-9        data.table_1.14.2 raster_3.5-15    
## [5] sp_1.4-7          SSDM_0.2.8       
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.6.2        sass_0.4.1           maps_3.4.0          
##  [4] viridisLite_0.4.0    jsonlite_1.8.0       splines_4.1.2       
##  [7] dotCall64_1.0-1      bslib_0.3.1          Formula_1.2-4       
## [10] shiny_1.7.1          highr_0.9            yaml_2.3.5          
## [13] pillar_1.7.0         lattice_0.20-45      glue_1.6.2          
## [16] spThin_0.2.0         digest_0.6.29        promises_1.2.0.1    
## [19] randomForest_4.7-1.1 colorspace_2.0-3     gbm_2.1.8           
## [22] htmltools_0.5.2      httpuv_1.6.5         Matrix_1.3-4        
## [25] pkgconfig_2.0.3      earth_5.3.1          bookdown_0.26       
## [28] purrr_0.3.4          xtable_1.8-4         scales_1.2.0        
## [31] terra_1.5-21         later_1.3.0          TeachingDemos_2.12  
## [34] tibble_3.1.6         proxy_0.4-26         mgcv_1.8-38         
## [37] generics_0.1.2       ggplot2_3.3.6        ellipsis_0.3.2      
## [40] nnet_7.3-16          cli_3.3.0            survival_3.2-13     
## [43] magrittr_2.0.3       crayon_1.5.1         mime_0.12           
## [46] evaluate_0.15        fansi_1.0.3          nlme_3.1-153        
## [49] class_7.3-19         shinydashboard_0.7.2 tools_4.1.2         
## [52] dismo_1.3-5          lifecycle_1.0.1      stringr_1.4.0       
## [55] munsell_0.5.0        plotrix_3.8-2        compiler_4.1.2      
## [58] jquerylib_0.1.4      e1071_1.7-9          rlang_1.0.2         
## [61] plotmo_3.6.2         grid_4.1.2           rstudioapi_0.13     
## [64] spam_2.8-0           rmarkdown_2.13       gtable_0.3.0        
## [67] codetools_0.2-18     reshape2_1.4.4       R6_2.5.1            
## [70] gridExtra_2.3        rgdal_1.5-32         knitr_1.38          
## [73] dplyr_1.0.9          fastmap_1.1.0        utf8_1.2.2          
## [76] poibin_1.5           stringi_1.7.6        parallel_4.1.2      
## [79] rmdformats_1.0.3     Rcpp_1.0.8.3         fields_13.3         
## [82] vctrs_0.4.1          rpart_4.1-15         tidyselect_1.1.2    
## [85] xfun_0.30