INTRODUCTION

1: Picture species 98 Phalanta phalantha (Sani 2009)

This is a tutorial for using the model spatialMaxent in R to create a Species Richness Map of butterflies in Pakistan. We have point information, the presence points of each species and want to create area information, the Species Richness Map. The point information is the geographical space, where the species occur (Pearson 2014) and the Species Richness Map is the niche space, which is the link between environmental information and the occurring points of the species. As a result you have information about where it is likely to find the species, but not where the species actually occur. It is important to know where you can expect the species if you are searching in the field for the species.

To begin with we have the species presence-only data of all 429 butterflies species in Pakistan and the environmental information. To run the data through the spatialMaxent model we have to prepare them in a certain way, which is described in the code below. We need:

  • The Environmental Layers, which are background points linked with the environmental information

  • The Samples Layers, which are the coordinates of each species in a file separated and also linked with the environmental information

  • The Projection Layers, which are each one of the environmental information separated in one file

THEORY OF THE MODEL

For understanding the model spatialMaxent it is useful to get to know the model MaxEnt first, since it is an extension of MaxEnt.

The model MaxEnt

MaxEnt, or maximum Entropy, is a machine learning method that helps to make predictions or draw conclusions with incomplete information (Phillips et al. 2006, p. 234ff.). It comes from the statistical mechanics and can be applied in different fields like as astronomy, image reconstruction or statistical physics (ebd.).

What is maximum Entropy?

Entropy is a scientific concept that is normally associated with a state of disorder, randomness, or uncertainty (Wikipedia 2023). The entropy is maximum when all the outcomes are equally likely to occur, so it is most uncertain (Bothe 2021).

2: Coin flipping (own representation)

A simple example to explain maximum Entropy is flipping a coin. When a fair coin is flipped, the two possible outcomes, heads or tails, are equally likely. In this case, there is maximum entropy because there is no prediction possible which outcome will occur. There are no additional information’s that could help us which result to expect.

What is MaxEnt for SDM?

To use the concept of maximum Entropy in Species Distribution Modeling the idea is to find a distribution that has the highest possible “spread” or “uniformity” while still meeting certain constraints (Phillips et al. 2006, p. 234ff.). Constraints are usually environmental aspects where the species occur like precipitation, temperature, height and so on (ebd.). By using these constraints, Maxent calculates the probability distribution for pixel in the grid over the study area, giving the likelihood of species presence (ebd.). The default output that we are using in this tutorial is a logistic output that gives an estimate between 0 and 1 of probability of presence as you can see in the picture below (Phillips 2017, p. 114f.)

3: Spatial prediction for Phalanta phalantha (own representation)

Advantages and limitations of MaxEnt

Advantages of MaxEnt:
  • It can work with presence data only, which is good when absence data is missing (Phillips et al. 2006, p. 234ff.)
  • The results are easy to analyze, you can see the relationship between environmental factors and species suitability (ebd.)
  • It tries to avoid overfitting the model and provides detailed outputs (ebd.)
  • It can handle large quantities or smaller datasets of occurrence data (Phillips et al. 2008, p. 161f.)
  • It provides default settings suitable for various species and environmental conditions (ebd.)
Limitations of MaxEnt:
  • It is not as widely used as other statistical methods (GLM or GAM) -> fewer guidelines and ways to estimate prediction errors (Phillips et al. 2006, p. 234ff.)
  • It is not inherently bounded and can give very high predictions outside the range of the study area (be careful when applying it to new or future environments) (ebd.)
  • It needs the external software java to work (ebd.)
  • Tuning the parameters can be time-consuming, especially if there is a large number of species (Phillips et al. 2008, p. 161f.)

What is new in spatialMaxent?

The extension spatialMaxent adds new features to the model (ebd.) As a validation method it uses the spatial cross validation were the sample points are grouped in blocks based on their spatial locations (Meyer et al. 2019). In each cross validation round, one blocks is held back as the test data, and the models are trained using data from the rest of the blocks (ebd.).

For spatial mapping the random cross validation is not making as good predictions as the spatial cross validation since the spatial dependencies are not included (ebd.).

Here you can see the difference between random cross validation and spatial cross validation:

4: Difference between random and spatial cross validation (Meyer et. al 2019)

To understand what exactly the cross validation constitutes, we can look at the illustration below, which is a spatial Block distribution from our data. Before the cross validation our model creates different blocks, where a cluster, so points with spatial proximity, should be in one block. Like you can see below, the blocks only appear at spots, where at least one presence point is located. The aim of the blocks is to contain as much presence points as possible, but the size of each block is the same. After that, you have a certain number of blocks and a number of folds, which are the number of rounds of the cross validation. If you have a fold number of 4, each block gets a number between 1 and 4. The distribution of these numbers can be random or systematic, but each number appears the same amount of times. We need these numbers for the cross validation and to create independent test data, because one of these numbers is generated for the test points.

5: Spatial blocks with presents points

The extension includes three tuning processes that help finding the best model and prevent overfitting:

  • Forward variable selection (FVS): the model trains with all possible 2-variable combination and selects the best one, which is trained again with all remaining variables separately and selects the best model, this step is repeated until the model cannot be further improved, now only the variables selected by the FVS are used for all further models (Meyer et al. 2019)

  • Forward feature selection (FFS): it follows a similar concept but focuses on selecting the best feature types (linear, hige…), it selects the model with the best feature and adds another feature until there is no improvement (ebd.)

  • Regularization multiplier tuning: it is computing models with different regularization multipliers and selecting the one that performs best (ebd.)

    -> The final model uses the variables, features and multiplier from the 3 tuning processes (ebd.)

IMPLEMENTATION

Data preperation

Before we can run the model spatialMaxent, we need to prepare the data that the model is working with. For our code we prepared the folder structure as followed:

In the following script we are working with relative paths, you can also work with absolute paths. In this case you need to change the paths and add your whole path to the code.

Set your working environment

You can also set a working directory with the setwd() function. We can skip this step, because of the relative paths to our R project. Then we need to check if we have installed and loaded all the packages we need. If you have not installed the packages you can do that with the install.packages() function.

The packages we need for this tutorial:

  • predicts: implements functions for spatial predictions methods

  • caret: includes functions that attempt to streamline the process for creating predictive models (data splitting, pre-processing, feature selection…)

  • sf: is used for handling and manipulating spatial data in a simple features format

  • geodata: for downloading of geographic data for use in spatial analysis and mapping (climate, crops, elevation, land use…)

  • dismo: implements some species distribution models (maxent, bioclim, domain..) and has functions for sampling background points, k-fold sampling and AUC

  • dplyr: tool for working with data frame like objects, both in memory and out of memory

  • terra: provides methods to manipulate geographic (spatial) data in “raster” and “vector” form (R Documentation)

  • raster: includes methods to create a RasterLayer object (R Documentation)

  • blockCV: includes functions for generating train and test folds for k-fold and leave-one-out (LOO) cross-validation (CV)

  • RColorBrewer: creates nice looking color palettes especially for thematic maps (R Documentation)

  • DT: provides an R interface to the JavaScript library DataTables

  • mapview: produces an interactive view of the specified spatial object(s) on top of the specified base map

# load packages that are needed
library(predicts)
library(caret)
library(sf)
library(geodata)
library(dismo)
library(dplyr)
library(terra)
library(raster)
library(blockCV)
library(RColorBrewer)
library(DT)
library(mapview)

Get your presence data

To get our presence data we download the data set “PakistanLadakh.csv” from ILIAS into our folder Data and load the .csv data as a dataframe into R. With colnames() we assign the column names “species”, “x” and “y” to the columns.

# load the .csv data as a dataframe into R
butterflies <- read.csv("Data/PakistanLadakh.csv")
colnames(butterflies)<- c("species","x","y")
# check how your data looks like
str(butterflies)
## 'data.frame':    5870 obs. of  3 variables:
##  $ species: chr  "Acraea_issoria" "Acraea_issoria" "Acraea_issoria" "Acraea_issoria" ...
##  $ x      : num  73.1 73.4 73.4 73.8 74.2 ...
##  $ y      : num  34.4 34.2 34.1 33.8 33.9 ...

The code str() gives information about the structure of the data. The output tells us there are 5870 observations (rows) and 3 variables (columns), which are species with the name of the butterflies, x and y with the coordinates.

Filter data

With the print() function we can check how many entries exist of one species.

# Check how many entries exist of one species
print(head(table(butterflies$species), 20))
## 
##       Acraea_issoria        Acraea_violae      Actinor_radians 
##                    5                    3                    3 
##     Acytolepis_puspa Aeromachus_stigmatus Aglais_caschmirensis 
##                    4                    9                   53 
##    Aglais_ladakensis        Aglais_rizana     Anaphaeis_aurota 
##                    3                   21                   52 
##    Apolria_nabellica       Aporia_belucha       Aporia_soracta 
##                   12                   33                   14 
##        Appias_albina      Appias_libythea      Argynnis_aglaja 
##                    1                    3                   33 
##   Argynnis_childreni   Argynnis_jainadeva      Argynnis_kamala 
##                   16                   31                   19 
##       Argynnis_niobe     Argynnis_pandora 
##                    1                    5

6: overview number of presents points

In this illustration you can see, which number of presence points occur how often. Viewer present points for each species appears more often than many presence points. We set the cut at 15 presence points, even though we will exclude a lot of species with this number. The reason for that is that our model is not working well with under 15 presence points. We do not want to exclude to many butterflies species, because we want to create a Species Richness Map of the butterflies in Pakistan. With a number of over 15 presence points, we only have 138 butterflies species left. If we would choose a higher number, it would only be a Species Richness Map of a view butterflies species in Pakistan. We can check if the number of presence points we choose is reasonable, when we create the test and train data. If it is not we come back to this step later and change the number. To filter all species with over 15 entries we count how often one species names appear in the data. For that we first need to group the data by the column “species” with group_by() and use the function count() to count how often one group appears. We need this to filter the data by the number of species.

# filter all butterflies that have more than 15 entries
count_butterflies <- butterflies %>% group_by(species) %>% count()

butterflies <- count_butterflies  %>%  filter(n>15)   %>%  
  inner_join(butterflies, by= "species") %>%
  select(species, x,y)

Environmental Layer

The next step is to create the environmental layer for the spatialMaxent model. For that we load the climate data for Pakistan (country code “PAK”) from the package “geodata” and store it in an object named “R”. The function worldclim_country() is used to retrieve the data.

- “res” specifies the spatial resolution of the data, here it is 10.
- “var” specifies the variable, in this case “bio” for bioclimatic variables.
- “path” specifies the path where the data will be stored.

biolayer_names <- c("Annual Mean Temperature in °C", 
                    "Mean Diurnal Range", 
                    "Isothermality", 
                    "Temperature Seasonality", 
                    "Max Temperature of Warmest Month °C", 
                    "Min Temperature of Coldest Month °C", 
                    "Temperature Annual Range °C", 
                    "Mean Temperature of Wettest Quarter °C", 
                    "Mean Temperature of Driest Quarter °C", 
                    "Mean Temperature of Warmest Quarter °C", 
                    "Mean Temperature of Coldest Quarter °C", 
                    "Annual Precipitation mm", 
                    "Precipitation of Wettest Month in mm", 
                    "Precipitation of Driest Month in mm", 
                    "Precipitation Seasonality", 
                    "Precipitation of Wettest Quarter in mm", 
                    "Precipitation of Driest Quarter in mm", 
                    "Precipitation of Warmest Quarter in mm", 
                    "Precipitation of Coldest Quarter in mm")

# get environmental layers and set the name
R <- geodata::worldclim_country(country = "PAK", res = 10, var = "bio", path = "Data/", 
    version = "2.1")

# masking the bio variables to the border of Pakistan
R <- terra::mask(R, geodata::gadm(country = "Pakistan", path = "Data/"))

filename0 <- "Data/border.gpkg"
borders <- sf::read_sf(filename0)
borders$GID_0 <- NULL

plot(borders)


# Plotting
for (j in 1:length(biolayer_names)) {
  plot(R[[j]], main = biolayer_names[j])
}

Biolayer

Create background points

The randomPoints() function from the dismo package is used to generate random points. The mask argument defines the range within which the points should be generated, here Pakistan. The argument n specifies how many random points will be generated, here 10 000. st_as_sf() is used to recognize the points as spatial coordinates. We create one file of background points for the model instead of one for each species.

# create background points
bg=sf::st_as_sf(as.data.frame(dismo::randomPoints(mask=raster::stack(R), n=10000)), 
                crs=terra::crs(R), coords=c("x","y"), remove=F) 
# the number of 10,000 background points isn't fix and can be reduced for RAM reasons

Now we extract the environmental information for our background points with extract() and bind them together with cbind().

The spatialMaxent model only accepts the background points if there are only the coordinates and the environmental information in the data set, so we need to remove any other entries. Before, we can save the background points as gpkg file, because we need them later with geometry information. For the csv file, which we need for the model, we delete the columns “ID” and “geometry”. Since the data set has to look the exact same as the samples layers, we create the columns “species” and “spatialBlock” and put them in the same order as the samples layers. It is not important what information we write in this new generated columns, because we only need them for the right order of the bio layers. Otherwise the model will connect the false bio layers with each other.

Then we save the background points as csv file.

# extract environmental information for background data
bg_extr=terra::extract(R, bg)
bg=cbind(bg,bg_extr);rm(bg_extr)

# adapt background information 
bg$ID <- NULL #remove column ID

filename1 <- paste0("Data/samples/background_",".gpkg")
    sf::write_sf(bg, filename1) #save background points with geometry
    
bg$geometry <- NULL #remove column geometry
bg$species <- "background"
bg$spatialBlock <- "1"
bg = bg %>% dplyr::select(c("species","x","y","spatialBlock"), everything())


# create file path and file for background points and save as csv file
    filename2 <- paste0 ("Data/samples/background",".csv")
    write.csv(bg, filename2, quote = F)

Now the environmental layers should look like this:

background <- read.csv("Data/samples/background.csv")
datatable(head(background, 10), options = list(scrollX = TRUE, scrollY = "500px"))

Projection layer

The projection layers are a raster file for each bio layer, which spatialMaxent need for the creating a map. We use the function nlyr() to find out how many layers R has and use a loop to extract the names of the bio variables with names(), create the layers and then save them as an asc file with writeRaster().
An alternative to the asc file is mxe, grd or bil files. It does not work with a tif file.

# create layers for each bio variable
num_variables <- nlyr(R)

for (i in 1:num_variables) {
  variable_name <- names(R)[i] #extract names of bio variable
  variable_layer <- R[[i]] #create layer for bio variable
  layer_file <- paste0("Data/raster/", variable_name, ".asc") #save layer
  writeRaster(variable_layer, layer_file, overwrite = TRUE)
}

Samples

In this chapter we create the train and test data for our spatialMaxent model. We need the test data later, to check if our model performed well, but we will not put them in the model. Therefore, we only need to prepare the train data in a certain way. The model only accepts the data as data frame with the first column as the name of the species, the second and third column as coordinates, the fourth column as number of the spatial block and the last one as the environmental information.

Split data

For our model spatialMaxent we need to split our data into separate dataframes for each species. For that we use split() and name the column “species”. In the next step we create a geopackage file for each species. We use a loop with the length of the created butterflies_split data and the butterflies_split data to create a file of the before separated butterflies species. To set up the file path, we use the function path.file(). The code in the bracket after file.path must be the exact path we want to create for the new generated file. We want the same name for each species, except for the number of the species at the end, which is created with the “ii” from the loop flow.

# create separate dataframes for each species
butterflies_split <- split(butterflies, butterflies$species)

# We are transforming the dataframe into an sf object
butterflies=sf::st_as_sf(butterflies, coords=c("x", "y"), remove=F, 
                         crs=sf::st_crs("epsg:4326"))

# create gpkg file
for (ii in 1:length(butterflies_split)) {
   sf::write_sf(butterflies_split[[ii]], file.path("Data/split_species/",
                                         paste0("butterflies_species_",ii,".gpkg"))) 
}

Now we have a geopackage file for each species in the path: Data/split_species with the name butterlies_split_ and the species number. As an example how the file name and path should look like for the species 98 Phalanta phalantha: “Data/split_species/butterflies_species_98.gpkg”

Create train and test data

We need a loop to create train and test data for each species and prepare the cross validation. The loop length is the number of species, because we want to create the test and train data for each species.

In the first step of the loop, we need the before generated split species data and read them in our script as a vector.

As described before, we need the environmental information as well in our train data, so we can connect them.

We use the package blockCV and function cv_spatial () to create the spatial blocks, which we will need for the spatial cross validation. You need to check with your data which k, size of blocks and iteration is reasonable. We do not have many presence points per species, so the creation of the blocks does not work with a k over 4 and size over 50 000 meters. With a bigger k and a bigger size the model has better results, but also needs longer to run. You can try the creation with different values and you can check if it works and the results seems good.

After running the creation of the blocks we can save the results, in which folds the species occur, in an extra column. This information is stored in the column folds_id of the before created vector.

In the next step we separate the data in trainings and test data. We have 4 folds, so we want one fold (the first fold) as test and 3 as trainings data. If you have more folds, you can create more test data, but our value of 25% test data is relatively high. You need to heed, that fewer train data lead to more inaccurate results, but you also need enough test data to check your results.

After this step we can save the test and train data as gpkg.

We need the train data as csv for our model, so we need to save them as a dataframe and remove the geometry and ID column, in order for the model to work.

# set folds for cross validation
k=4 

for( iii in 1:length(butterflies_split)){
  
  # filepath where we saved the generated split files of each species before
  filename <- paste0("Data/split_species/butterflies_species_",iii,".gpkg") 
  butterflies_data_tmp <- sf::read_sf(filename) #read the before genarated split files
  
  # convert object into sf object and change the coordinate system 
  butterflies_data=sf::st_as_sf(butterflies_data_tmp, coords=c("x", "y"), remove=F, 
                                crs=sf::st_crs("epsg:4326")) 
  
  # add environmental information to butterflies data
  # extract samples and environmental information 
  butterflies_extr=terra::extract(R,butterflies_data) 
  
  # connect samples and environmental information
  butterflies_data=cbind(butterflies_data, butterflies_extr); rm(butterflies_extr) 
   
  df_4326 <- sf::st_transform(butterflies_data, "epsg:4326") #change coordinate system 
  
  
  # create spatial blocks 
  
  sb <- blockCV::cv_spatial(x = df_4326, 
                   # column where the name of the species is stored
                            column  = "species", 
                   # number of folds for cross validation, set at begin of the loop
                            k = k, 
                   # range of the blocks
                            size = 50000, 
                   # selection option 
                            selection = "random", 
                   # how often the folds are being created and compared
                            iteration=200, 
                   # create plots
                              plot=T 
                              
  )
  
  plot(borders,add=T)
  
  
 # we create a new colomn with the number of fold (1-4), where the species occur
  butterflies_data$spatialBlock <- sb$folds_ids 
  butterflies_data=na.omit(butterflies_data) #remove empty fields


  fold =1 
   
    `%not_in%`<- Negate(`%in%`)

 
    # create train and test data
    
    # choose butterflies data which is not in before generated vector fold as train data
    trainData = butterflies_data%>%dplyr::filter(spatialBlock  %not_in% fold) 
    
    # as test data the species which are in the vector fold
    testData = butterflies_data%>%dplyr::filter(spatialBlock %in% fold) 
    
    # create file path and gpkg file for test data
    filename3 <- paste0("Data/samples/butterflies_testData_",iii,".gpkg")
    sf::write_sf(testData, filename3)
    
    # create file path and gpkg file for train data
    filename4 <- paste0("Data/samples/butterflies_trainData_",iii,".gpkg")
    sf::write_sf(trainData, filename4)
    
    # preparation for csv train file
    trainData = as.data.frame(trainData) #transform to data frame
    trainData$geometry<-NULL #remove geometry information
    trainData$ID<-NULL #remove ID column
    
    # adapt order of the dataframe 
    trainData=trainData %>% dplyr::select(c("species","x","y","spatialBlock"), 
      everything())
    
    # create a file path and file for csv train data
    filename5 <- paste0("Data/samples/butterflies_trainData_",iii,".csv")
    write.csv(trainData, filename5, quote = F, row.names = F)
}

Here you can see the spatialBlocks which were created for the species Phalanta phalantha.

7: SpatialBlocks Species 98 Phalanta phalantha

The training data for the species 98 Phalanta phalantha should look like this:

traindata_species98 <- read.csv("Data/samples/butterflies_trainData_98.csv")
datatable(traindata_species98, options = list(scrollX = TRUE, scrollY = "500px"))

Check results

Until now we have 3 important variables we can modify. The first one is the filter of the data with over 15 presence points, the second one is the number of folds and the third one the size of the blocks. The aim is the best results for the model, but it also has to work. With more presence points we can set the number of folds higher, but we want to involve as much butterflies species as possible. Either way we must compromise, try different options and check the results. To check the results we look at the data with QGIS, which is depicted in the illustration below.

8: Trainings-, test- and background data 98 Phalanta phalantha

You should have a look if the train and test data have a reasonable number.

After trying different options and choosing the best one, we can finally run the model.

RUN THE MODEL

To run the model you have to install spatialMaxent: Download spatialMaxent

You also need Java: Download Java for Windows

For one species

If you only want to run the model with one species, you can open the model with the batch file for windows and the JAR file for iOS. Now, it should appear a window like the one below:

9: spatialMaxent console

You can load the created environmental layers, the projection layers and the samples into the model. Then you have to choose an output directory. After that you can change some settings according to your data and gain (we used the default setting) and click on Run.

For all species

It is more complicated to run all species through the model, because you can only load one species and not all at once. Thus, it is easier to create a loop and access the model via R.

For that you first need to set the path, where the spatialMaxent Jar is saved. Then, you load the before created samples as a list and set the length of the loop as the length of this list. We want an extra folder for each output file and create this with the function dir.create() for every loop flow, and therefore for every species. We use the function system() to access the spatialMaxent model with R. Now, we have to set different options and the path, where we saved our environmental layers, projection layers, samples and where we want the output directory. This step could take from hours to days. To speed the process up a little, you can set the threads higher.

# set the path where spatialMaxent is saved
jarPath= "java -jar C:/Users/baumj/Documents/Dokumente/Uni/4Semester/
  speciesdistribution/spartialMaxent/spatialMaxent_jar/spatialMaxent.jar"

# read the data
speciesList <- list.files(file.path("Data/samples"), full.names = F, recursive = F, 
  pattern=".csv")
speciesList=speciesList[-1] #excludes the first entry since it is our background file

# read the data into spartialMaxent and define the settings
for (iiii in 1:length(speciesList)) {
      
 dir.create(paste0(getwd(), "/Output/butterflies_trainData_",iiii), recursive = T)
  
    system(paste0(jarPath, " threads=4 outputdirectory=",
getwd(), "/Output/butterflies_trainData_",iiii,"Spatial Maxent.pdf 
samplesfile=",getwd(),"/Data/samples/butterflies_trainData_",iiii, ".csv 
environmentallayers=",getwd(),"/Data/samples/background.csv warnings=false 
projectionLayers=",getwd(), "/Data/raster/ 
outputGrids=false writeMESS=false writeClampGrid=false askoverwrite=false autorun"))
} 

RESULTS

The spatialMaxent model gives as one result a distribution map of each species. You can see the Species Distribution Map of Phalanta phalantha below. We added the test and train data and changed the color with QGIS. The illustration shows the probability of a species to occur at each area. The presence points are, as they should be in most of the cases, in the red and yellow area where a species occur with a probability of 50%. However, you have some points which are at an area, where our model predicts a low probability of occurrence.

10: distribution 98 Phalanta phalantha

To get our Species Richness Map we are stacking all prediction maps from the results of spatialMaxent together with stack(). We need to decide at which percentage value of the predicted map we say: here occurs a species (1) and here occurs no species (0). We decided for a limit value of 0.5.
This step can take some time (in our case 3 hours).

The threshold 0.5 we used is a commonly used threshold for generating Species Richness Maps (Jimenez-Valverde & Lobo 2007, p. 362). There have been different threshold criteria proposed in the literature to find out which threshold is most suitable and it has a high influence on the accuracy of the map (ebd.). If you want to make sure to use the right threshold for your goal of the map, you can read the research of Jimenez-Valverde & Lobo (2007).

# stack all raster files into one

files <- list.files(path = "Output", pattern = ".+.raster.asc$", full.names = T, 
  recursive = T)
rstack <-  stack(files)
 
# set the value 1 to the values over 0.5 and the value 0 to the once under 0.5
rstack[rstack > 0.5] <- 1
rstack[rstack <= 0.5] <- 0


# display RColor Brewer palettes
display.brewer.all()

# plot rstack with the values 1 and 0 for the species 98
cols2 <- brewer.pal(n=11, name="Spectral")
Spectral <- colorRampPalette(cols2)  
plot(rstack[[98]],col=rev(Spectral(100)),
     # axis label
     xlab = "longitude", ylab = "latitude", legend = TRUE, 
     # add legend
     legend.args = list(text = "species richness", side = 4, font = 2, line = 2.5, 
        cex = 1))

In this illustration you can see that the yellow and red area from before with a probability of occurrence over 50% turns into the red one, where we say the species occurs. At the blue area the species does not occur.

11: rstack species 98

Finally, we sum the raster stack with sum() to get our Species Richness Map.

# sum of rstack
rstack_sum <- sum(rstack)


# Pfad zur Speicherung des GeoPackages
filepath <- "rstack_sum.gpkg"

# Rasterobjekt im GeoPackage-Format speichern
writeRaster(rstack_sum, filepath, format = "GPKG")


# plot the species richness map for all species
plot(rstack_sum, col=rev(Spectral(100)),
     xlab = "longitude", ylab = "latitude", legend = TRUE, 
     legend.args = list(text = "species richness", side = 4, font = 2, line = 2.5, 
       cex = 1))

The Species Richness Map

speciesRichness <- raster("rstack_sum.gpkg")

mapview(speciesRichness, zcol="species richness", col.regions= rev(RColorBrewer::brewer.pal(10, "Spectral")))
## Warning in rasterCheckSize(x, maxpixels = maxpixels): maximum number of pixels for Raster* viewing is 5e+05 ; 
## the supplied Raster* has 3528000 
##  ... decreasing Raster* resolution to 5e+05 pixels
##  to view full resolution set 'maxpixels =  3528000 '

EVALUATION

Now we created a species richness map, but how do we know that our model is working well for predicting where a species occurs? For that we used the evaluating tools Boyce Index, AUC and MAE.

The Boyce Index indicates how much the model predictions differ from the random prediction (help fuction R; [Package ecospat version 3.5.1 Index]). The index can take values between -1 and +1. Positive values show that the prediction of the model is consistent with the distribution of presences in the evaluation data set, while negative values show that presences are more common in poor-quality areas, according to the model (ebd.). Values close to zero indicate that the model prediction is as good as a random prediction (ebd.).

The area under the ROC curve (AUC ) can take a value between 0 and 1 (Marburg University 2023). A value of 0.5 indicates the model is random, while a higher value indicates a model that is better than random (ebd.).

To calculate the mean absolute error (MAE), the absolute error values are summed up and then the value is divided by the value of the sample size (Willmott et al. 2005, p. 80).

Before we start with the evaluation we have to change the name of the folder from the output of the spatialMaxent program, so that we can organize the evaluation of each species into the right species. This is because R sorts species differently than how they are stored in our files.

# loop for all folders
for (folder_number in 1:138) {
# Format folder number according to range of values
  if (folder_number < 10) { #for all numbers from 1 to 9 add "00"
    formatted_number <- sprintf("00%d", folder_number)
  } else if (folder_number < 100) { #for all numbers from 10 to 99 add "0"
    formatted_number <- sprintf("0%d", folder_number)
  } else {
    formatted_number <- as.character(folder_number)
  }
  
# Using the formatted number to create the old and new folder path
# it is important to use the absolute path
  old_folder_path <- paste0("C:/Users/baumj/Documents/Dokumente/Uni/4Semester/
    speciesdistribution/spartialMaxent/Output/butterflies_trainData_", 
                            folder_number)
  new_folder_path <- paste0("C:/Users/baumj/Documents/Dokumente/Uni/4Semester/
      speciesdistribution/spartialMaxent/Output/butterflies_trainData_", 
                            formatted_number)
  
# Create the new folder if it doesn't exist
  if (!dir.exists(new_folder_path)) {
    dir.create(new_folder_path)
  }
  
 # Rename the folder
  file.rename(old_folder_path, new_folder_path)
}

Now the folder structure should have changed from this:

into this:

The same we have to do with the files for the test data. The difference to the step before is, that here we change the name of the files themselves and not of the folder.

# loop for all files
for (file_number in 1:138) {
 # Format files number according to range of values
  if (file_number < 10) { #for all numbers from 1 to 9 add "00"
    formatted_file_number <- sprintf("00%d", file_number)
  } else if (file_number < 100) { #for all numbers from 10 to 99 add "0"
    formatted_file_number <- sprintf("0%d", file_number)
  } else {
    formatted_file_number <- as.character(file_number)
  }
  
# Using the formatted file number to create the old and new file path
  old_file_path <- paste0("C:/Users/baumj/Documents/Dokumente/Uni/4Semester/
    speciesdistribution/spartialMaxent/Data/samples/butterflies_testData_", 
                          file_number, ".gpkg")
  new_file_path <- paste0("C:/Users/baumj/Documents/Dokumente/Uni/4Semester/
    speciesdistribution/spartialMaxent/Data/samples/butterflies_testData_", 
                          ormatted_file_number, ".gpkg")
  
# Create the new file if it doesn't exist
  if (!file.exists(new_file_path)) {
    file.create(new_file_path)
  }
  
 # Rename the file
  file.rename(old_file_path, new_file_path)
}

This shows how the file was named before:

And this after the renaming:

The next step to evaluate the model is to read the background scores and create a table with the name of the species, where we can enter the value of the evaluation of each of the evaluation tools.

#load background points from the files
bg_eval <- st_read("Data/samples/background_.gpkg")

# create a table with every species and their count of entries
# create an empty column for each evaluation tool
butterflies_boyce <- count_butterflies
butterflies_boyce$core <- NA #empty column for the boyce index
butterflies_boyce$AUC <- NA #empty column for the AUC
butterflies_boyce$MAE <- NA #empty column for the MAE

Next, we need to read the test data and the raster asc file of the output from the spatialMaxent program into R for each species. Then, the Boyce Index is calculated with the function ecospat.boyce(), followed by the MAE and AUC, which we then enter into the previously created table. To calculate the MAE an AUC we need to extract the values of the test data from the prediction raster with the function extract(). After that we create a new dataframe with the function data.frame() which contains the columns “observed” with the value 1 and the column “predicted” with the extracted values from before. The next step is to extract the background points with the prediction raster. Again we create a dataframe with the column “observed” with the value 1 and the column “predicted” with the extracted values from the background points. To align the two extracted data sets to have the same number of rows (as the background data set usually contains many more entries), the function slice_sample() from the dplyr package is used. It generates random samples from the extracted background dataframe. This ensures that the number of rows in both data sets are equal. After merging the two new dataframes with the rbind() function, we can calculate the MAE and the AUC.

We evaluate the model with multiple parameters as it allows for a better assessment of how well the model performed.

for (iiiii in 1:138) {
  
# adds leading zeros to iiii so renamed files can be read
  iiiii_padded <- formatC(iiiii, width = 3, flag = "0") 

# load test data for each species
  testData=sf::read_sf(paste0("Data/samples/butterflies_testData_", iiiii_padded,
                              ".gpkg"))
# load the Output raster asc from the spatialMaxent program for each species
  inputdir <- paste0("Output/butterflies_trainData_", iiiii_padded, "/final/") 
  filename6 <- list.files(path=inputdir, pattern = ".+.raster.asc$", full.names = TRUE)
  filename6

  file_raster <- raster(filename6)

# calculate the boyce-index:
  boyceIndex=ecospat::ecospat.boyce(fit=file_raster, obs=testData)
  boyceIndex=boyceIndex$cor #Store column cor in boyceIndex

  butterflies_boyce[iiiii,"core"] <- boyceIndex #write to previously created table
  
# extract the values of the test data from the prediction raster for AUC and MAE
  extr_test = raster::extract(file_raster, testData)
  testData$extr<-extr_test #save in column extr in the test data
  
# new dataframe containing columns "observed" with value 1 
  #and "predicted" with extracted values from extr_test
  extr_test = data.frame(observed = 1, predicted=extr_test)
      
# load background data and extract
  extr_bg = raster::extract(file_raster,bg_eval)
  
# new dataframe containing columns "observed" with value 0 
  #and "predicted" with extracted values from extr_bg
  extr_bg = data.frame(observed = 0, predicted=extr_bg)
  
# setting both dataframes on the same number of rows
  extr_bg = extr_bg%>%dplyr::slice_sample(n=nrow(extr_test)) 
  

 # bind both dataframes together
  df = rbind(extr_bg, extr_test); rm(extr_bg, extr_test)
  df = na.omit(df)
    
    
  # calculate MAE    
      MAE= Metrics::mae(df$observed, df$predicted)
      
   # calculate AUC 
      AUC = Metrics::auc(df$observed, df$predicted)
  
  # save in dataframe butterflies boyce
      butterflies_boyce[iiiii,"MAE"] <- MAE
      butterflies_boyce[iiiii,"AUC"] <- AUC
}

Now we remove the column “geometry” and write out the evaluation.

butterflies_boyce$geometry <- NULL #remove column geometry
filename7<- paste0("Data/evaluation.csv") #save butterflies_boyce 
write.csv(butterflies_boyce, filename7)

To assess the overall performance of our model for all species, we calculate the median for each evaluation tool.

evaluation<- read.csv("evaluation.csv", sep = ";") #load evaluation data
# calculate the median of the boyceIndex
median_boyce <- median(evaluation$core)
print(median_boyce)
## [1] 0.596
# calculate the median of the MAE
median_MAE <- median(evaluation$MAE)
print(median_MAE)
## [1] 0.2793286
# calculate the median of the AUC
median_AUC <- median(evaluation$AUC)
print(median_AUC)
## [1] 0.9356921

In this table you can look up our evaluation value for each species.

datatable(evaluation, options = list(scrollX=T, scrollY="500px"))

It is strange that we have some negative values in the Boyce Index, which means a negative correlation between the test points and the prediction of our model. We will have a look at this later again. The Boyce Index also has a graphic for each species. At the following illustration you can see the example of species Phalanta phalantha. Phalanta phalantha has a Boyce Index of 0.4, which is relatively low. The X shows the prediction of our model if the species occurs and the points are the presence points of the species. If the model works good, you should have exponential rising points to the high habitat suitability. The most points are as they should in this area, but we also have some outliers in the area of a very low probability of occurrence. We also do not have a rising function. This shows that our model is not working very good for species Phalanta phalantha.

13: Boyce index species 98 Phalanta phalantha

If we look at the best Boyce Index of 0.934, which is species 10 Ariadne merione, we can see better how the Boyce Index graphic should look like. We still do not have the best result, because of the gaps and the deviate of the rising trend in the end, but it is better. One reason could be the low data quantity of our butterflies.

14: Boyce Index species 10 Ariadne Merione

CONCLUSION

Our model has a strong correlation, which can be argued with the Boyce Index of 0.596 and the AUC and MAE of 0.936 and 0.279. So our model did work, but not with all species, this you can see for example at species 111 Plebejus pheretiades with a Boyce Index of -0.69. To find out the reason for that, we look at the data with QGIS, which you can see below.

15: distribution 111 Plebejus pheretiades

16: closer look at test points of species 111 Plebejus pheretiades

You can see we have an extreme cluster, where only test data occur. That is the aim of spatialMaxent, but in this case it could be the reason for the bad result. But we do not know for sure, if all of the test data are in one fold or it just was a coincidence that more than one test data fold were nearby. Then the bad distribution of the block number could be a reason for the negative result. Another reason could be the low number of presence points in each block. But we could not change this, because otherwise we could not run our model. Bad environmental information could be another reason, as well as the low number of presence points. Because if there would be more clusters, the model could predict the other clusters better. To see if the number of presence points for each species are the reason, we created a graphic, where you can see if the number of presence points and Boyce Index correlate. The graphic and Spearman correlation index show, that there is no correlation between this two variables.

17: link between boyceIndex an number of presents points per species

cor.test(evaluation$n, evaluation$core, alternative = "greater", method = "spearman")
## Warning in cor.test.default(evaluation$n, evaluation$core, alternative =
## "greater", : Kann exakten p-Wert bei Bindungen nicht berechnen
## 
##  Spearman's rank correlation rho
## 
## data:  evaluation$n and evaluation$core
## S = 419395, p-value = 0.3105
## alternative hypothesis: true rho is greater than 0
## sample estimates:
##        rho 
## 0.04245311

If we used a random cross validation, one of the points could have been in the train data and the model may have been better. Actually, the model should have predicted this cluster, but it did not, because of the reasons above. If there would be more clusters, the model could predict the other clusters better. In conclusion you can say that, in this case, Maxent would have been working better, but all in all we have a good result.

Despite of this, you should heed at the good quality of the environmental information and of enough present points.

REFERENCES

Bothe, D. (2021): Tutorial MaxEnt. https://geomoer.github.io/moer-bsc-project-seminar-SDM/unit99/student_tutorials-04b_maxent_Bothe.html (accessed: 12.06.2023)

Jime´nez-Valverde & A., J. M. Lobo (2007): Threshold criteria for conversion of probability of species presence to either–or presence–absence. Acta Oecologica. 31. 361-369.

Marburg University (2023): Species Distribution Modelling. https://geomoer.github.io/moer-bsc-project-seminar-SDM/unit05/unit05-06_assessment.html (accessed 23.06.2023)

Meyer H., C. Reudenbach, S. Wöllauer & T. Nauss (2019): Importance of spatial predictor variable selection in machine learning applications – Moving from data reproduction to spatial prediction. Ecological Modeling. 411.

nature4.0 (2022): spartialMaxent https://nature40.github.io/spatialMaxentPaper/docs/020_the_original/ (accessed: 12.06.2023)

Pearson, R (2014): Species Distribution Modelling Training Course. https://www.youtube.com/watch?v=obuMW5NAtJE&list=PLKYTvTbXFuChaoF-L-1e9RzCagdLPQcCU (accessed: 8.06.2023)

Phillips, S. J. (2017): A Brief Tutorial on Maxent. http://biodiversityinformatics.amnh.org/open_source/maxent/ (accessed: 11.06.2023)

Phillips, S. J.& M. Dudík (2008): Modeling of species distributions with Maxent: new extensions and a comprehensive evalution. Ecography. 31/2. 161-175.

Phillips, S. J., R. P. Anderson & R. E. Schapire (2006): Maximum entropy modeling of species geographic distributions. Ecological Modelling. 190/3-4. 231–259.

Wikipedia: Entropy (information theory) https://en.wikipedia.org/wiki/Entropy_(information_theory) (accessed: 21.06.2023)

Willmott, C.J. & K. Matsuura (2005): Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. CLIMATE RESEARCH Clim Res. 30: 79-82)

1: Sani, T. (2009): Schmetterlingsbild https://live.staticflickr.com/2607/3906375512_d64ce03b6a_b.jpg (accessed:13.06.2023)

sessionInfo()
## R version 4.2.3 (2023-03-15 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] mapview_2.11.0     DT_0.28            RColorBrewer_1.1-3 blockCV_3.1-2     
##  [5] dplyr_1.1.2        dismo_1.3-9        raster_3.6-20      sp_1.6-0          
##  [9] geodata_0.5-8      sf_1.0-12          caret_6.0-94       lattice_0.20-45   
## [13] ggplot2_3.4.2      predicts_0.1-8     terra_1.7-29      
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-162            satellite_1.0.4         lubridate_1.9.2        
##  [4] webshot_0.5.5           tools_4.2.3             bslib_0.4.2            
##  [7] rgdal_1.6-5             utf8_1.2.3              R6_2.5.1               
## [10] rpart_4.1.19            KernSmooth_2.23-20      DBI_1.1.3              
## [13] colorspace_2.1-0        nnet_7.3-18             withr_2.5.0            
## [16] tidyselect_1.2.0        leaflet_2.1.2           compiler_4.2.3         
## [19] leafem_0.2.0            cli_3.6.1               sass_0.4.5             
## [22] scales_1.2.1            classInt_0.4-9          proxy_0.4-27           
## [25] stringr_1.5.0           digest_0.6.31           rmarkdown_2.21         
## [28] base64enc_0.1-3         pkgconfig_2.0.3         htmltools_0.5.5        
## [31] parallelly_1.35.0       fastmap_1.1.1           htmlwidgets_1.6.2      
## [34] rlang_1.1.0             rstudioapi_0.14         farver_2.1.1           
## [37] jquerylib_0.1.4         generics_0.1.3          jsonlite_1.8.4         
## [40] crosstalk_1.2.0         ModelMetrics_1.2.2.2    magrittr_2.0.3         
## [43] Matrix_1.5-3            Rcpp_1.0.10             munsell_0.5.0          
## [46] fansi_1.0.4             lifecycle_1.0.3         stringi_1.7.12         
## [49] pROC_1.18.2             yaml_2.3.7              MASS_7.3-58.2          
## [52] plyr_1.8.8              recipes_1.0.6           grid_4.2.3             
## [55] parallel_4.2.3          listenv_0.9.0           splines_4.2.3          
## [58] knitr_1.42              pillar_1.9.0            future.apply_1.10.0    
## [61] reshape2_1.4.4          codetools_0.2-19        stats4_4.2.3           
## [64] glue_1.6.2              evaluate_0.20           leaflet.providers_1.9.0
## [67] data.table_1.14.8       vctrs_0.6.1             png_0.1-8              
## [70] foreach_1.5.2           gtable_0.3.3            purrr_1.0.1            
## [73] future_1.32.0           cachem_1.0.7            xfun_0.38              
## [76] gower_1.0.1             prodlim_2023.03.31      e1071_1.7-13           
## [79] class_7.3-21            survival_3.5-3          timeDate_4022.108      
## [82] tibble_3.2.1            iterators_1.0.14        hardhat_1.3.0          
## [85] units_0.8-2             lava_1.7.2.1            timechange_0.2.0       
## [88] globals_0.16.2          ellipsis_0.3.2          ipred_0.9-14