Tutorial MaxEnt

Introduction

General

In this tutorial, the distribution of several butterfly species across Parkistan will be modeled using a maximum entropy (MaxEnt) approach. From this data, a species richness map will be generated. Additionally, a screening for the influence of different environmental parameters and different modeling arguments will be conducted.

What is Entropy?

Entropy is a fundamental concept in science and can be seen as the degree of disorder in a system. It was introduced already in 1865 by the german physicist Rudolf Clausius.

The second law of thermodynamics states that the entropy of an isolated system (like the universe itself) will always increase over time. It may be possible to decrease the entropy of one compartiment (like a biological cell, which is highly ordered and thus possesses a low amount of entropy) but this is only possible, if the entropy in another system increases to compensate for the loss of entropy.

This makes entropy one of the only principles in physics which are tightly bound to the arrow of time and cannot be reversed. This, in turn, means that entropy is an explanation why some processes proceed in a certain direction and cannot be reversed. One example is the mixing of 2 gases by diffusion which can be simulated with a tool developed by the University of Colorado. (run a diffusion experiment)

In this simulation, 2 different gases are separated by a wall. If you remove the wall, both gases will slowly mix and become perfectly mixed as time passes.

The concept of entropy is basically in our genes because we are used to it and processes which violate the second law of thermodynamics don’t make sense to us. We could also ask the question “why don’t the gases spontaneously separate?” but most people will instantly recognize that such a process is not possible. In fact, the only reason for the direction of this process is entropy. Entropy can be seen as a statistical phenomenon as the spontaneous separation of two gases means that every molecule of each gas has to travel to the same side of the container. Since the thermal motion of molecules is undirected, this is very unlikely to happen (but in theory still not impossible though).

Other processes which are driven by entropy are:

Heat is always transfered from the warmer body to the colder one until both bodies have the same temperature (actually the first description of entropy by Clausius)
When you let a ball fall down it stops bouncing and will not spontaneously start bouncing again after it stopped
A compressed gas will expand and fill the maximum possible space
Your room gets untidy from alone and in order to clean in up you have to put energy in it

As you can see, entropy is quite an important principle which is everywhere and determines which processes can proceed and which ones cannot. Or as Stephen Hawking put it:

“The increase of disorder or entropy is what distinguishes the past from the future, giving a direction to time.” - Stephen Hawking, A Brief History of Time

Entropy in Information Theory (Shannon Entropy)

Entropy still can have slightly different meanings in different scientific applications. In information theory, entropy is a measure of information content. If we have two probability distibutions, the one which is most spread out and thus has the maximum entropy, also contains the maximum of information, as it is explained in the following video.

The measure of entropy of two probability distributions can be described as the Shannon entropy

(Robert & Fortin 2019)

Theory on MaxEnt

Basic Principle

So what does the principle of maximum entropy mean for our species distribution model? The MaxEnt algorithm tries to find the probability distribution with the highest entropy - the one which is most spread out but still satisfies the given constraints.

In the following graphs, two probability distributions are plotted. This could be the probability of a species being present at the geographic coordinates X1, X2 and X3. In this example, the orange distribution has the highest entropy and if you would calculate the Shannon entropy of both distributions according to the above mentioned equation, you would yield a value of 0.53 for the blue and 0.60 for the yellow distribution.

Why MaxEnt?

The reason why the algorithm does that is that the MaxEnt model focuses on describing probabilities based on the available data while being cautious on making predictions which can’t know with certainty. For example we don’t know if a species is distributed unevenly inside a region with fitting environmental parameters because it likes a very specific temperature more than the average temperature or because this region was just sampled more frequently while collecting the data.

One major advantage of MaxEnt is that it can use presence-only data. Most of the time, absence data is not available and if it is, it can be biased and of questionable value. However, presence-only data can also be biased and thus should be trated with caution.

Presence data can also be biased due to some of the following reasons:

Sampling often occurs on spots which can easily be accessed like roads or rivers
Multiple samples can be collected by one researcher from spots which are very close to one another (autocorrelation)
Sampling intensity and sampling methods may vary across the study area
Geographical data can be inaccurate when using rather old datasets
Species misidentification can lead to errors in the dataset

(Phillips 2006)

Mathematis behind MaxEnt

Mathematics of MaxEnt is rather complex and will not be covered in detail. In general, random background points across the study area will be sampled to describe the environmental parameters in the whole study area. This is the most even (maximum entropy) distribution (lower part of the figure below). Also, the environmental parameters at your species locations will be sampled (upper part of the figure below).

(Phillips et al. 2011)

After this, the two distributions will be matched to one another so the distribution of the presence locations will be matched to the spread out background distribution, thus maximizing its entropy but still satisfying certain constraints. The measure of difference of two probability distributions is called Kullback-Leibler divergence or relative entropy which can be caclulated using the following formula.

(Merow et al. 2013)

The algorithm behind MaxEnt thus minimizes the Kullback-Leibler divergence to obtain a maximum entropy distribution. This first sounds strange because KL-divergence is also called relative entropy and minimization of this leads to a maximum entropy distribution. The reason is that the random background distribution already has the maximum entropy and thus minimizing the difference between this distribution and our species distribution will lead to an increase in entropy.

The actual maths behind MaxEnt is far more complex and will not be covered, but if you want to know more about the mathematical principe you should take a look at the papers of Merow et al. (2013) and Philipps et al. (2011).

Advantages and Disadvantages

Advantages of MaxEnt include:

Requires only presence data
Can utilize continuous and categorical data & can incorporate interactions between different variables
Efficient algortithms have been developed that guarantee to converge to the optimal maximum entropy distribution
The model can interpret how much a variable contributes to suitability
Over-fitting can be avoided using l-regularization
Can also be applied to presence-absence data when using a conditional model (here we use an unconditional model)
Active area of research with numerous papers beig published, a lot of applications and improvements

Drawbacks of MaxEnt include:

Not as mature (statistically) as GLM or GAM –> fewer guidelines on usage and fewer methods to estimate errors
The exponential model can give large values for environmental conditions outside the study area –> careful when extrapolating to other study areas or different climatic conditions –> features outside the value range of the study area should be clamped to set a lower and upper bound

(Phillips 2006)

MaxEnt is quite a trendy method with an increasing number of publications and thus many new insights are gained. Also, MaxEnt can be used for several other applications like in analytical chemistry to clean data of measurements.

(Morales et al. 2017)

Features in MaxEnt modeling

MaxEnt also offers different features which determine, how an environmental variable will be treated during the modeling. The available features are linear, quadratic, product, threshold, hinge and categorical. By default, all features will be enabled (at least when using MaxEnt with the package dismo). At the end of this tutorial, we will also check the influence of those feature on our modeling result.

(Fletcher & Fortin 2019)

Performing MaxEnt modeling

Used packages

#Standard packages for handling (spatial) data
library("raster")
library("ggplot2")
library("dplyr")
library("sp")

#Additionally:
library("dismo") #for SDM techniques
library("plotly") #to create interactive plots
library("reactable") #to prepare an interactive table of a large dataset
library("rgeos") #to crop a spatialPointsDataFrame
library("rmdformats") #for layout of html document
library("utils") #for session info

Obtaining and analyzing data

The first step is to load the observed presence data from a .csv file. This can be done by the read.csv() function. A summary of the dataset show us there is 5870 entries in this file. The structure of the columns is “species - longitude - latitude” but the colum names are strange so we set the column names to the right values using the names() function. Having the right column names is very important (especially later on for the environmental variables t_mean and precip) otherwise the MaxEnt algorithm will not be able to recognize which values to take.

#Get data from CSV file
data <- read.csv("PakistanLadakh.csv")
summary(data) #5870 entries

##   ï..species              x               y        
##  Length:5870        Min.   :62.23   Min.   :24.13  
##  Class :character   1st Qu.:71.90   1st Qu.:33.79  
##  Mode  :character   Median :73.20   Median :34.74  
##                     Mean   :72.71   Mean   :34.27  
##                     3rd Qu.:73.93   3rd Qu.:35.94  
##                     Max.   :77.78   Max.   :37.12

head(data) #structure: species - lon - lat

##       ï..species        x        y
## 1 Acraea_issoria 73.08635 34.43060
## 2 Acraea_issoria 73.36326 34.24367
## 3 Acraea_issoria 73.36758 34.14221
## 4 Acraea_issoria 73.82637 33.84266
## 5 Acraea_issoria 74.19839 33.94852
## 6  Acraea_violae 72.34951 31.70947

#Get rid off strange column names
names(data) <- c("species", "lon", "lat")
head(data)

##          species      lon      lat
## 1 Acraea_issoria 73.08635 34.43060
## 2 Acraea_issoria 73.36326 34.24367
## 3 Acraea_issoria 73.36758 34.14221
## 4 Acraea_issoria 73.82637 33.84266
## 5 Acraea_issoria 74.19839 33.94852
## 6  Acraea_violae 72.34951 31.70947

Next, we can take a further look at the data using the unique() function which will give us a vector of every species name. length() shows us there are 429 different species in this dataset. Since in total there are 5870 entries in the dataset, on average 13.7 records per species are present. This is not a lot and we have to keep this in mind when evaluating our model.

#Retrieve species names
names <- unique(data$species)#Obtain vector of species names
names[1:5] #Take a look at the result

## [1] "Acraea_issoria"       "Acraea_violae"        "Actinor_radians"     
## [4] "Acytolepis_puspa"     "Aeromachus_stigmatus"

length(names) #429 different species

## [1] 429

5870/429 #On average 13.7 records per species

## [1] 13.68298

Last, we prepare a SpatialPointsDataFrame setting the CRS to WGS84 using the proj4string = CRS(“+init=EPSG:4326”) command to be able to plot the points properly.

#Prepare spatialPointsDataFrame (spdf)
spdf <- SpatialPointsDataFrame(cbind(data$lon, data$lat), data, proj4string = CRS("+init=EPSG:4326"))

Obtain data of climate and Parkistan

Next, we need to obtain the environmental data as well as the border of Parkistan. For the climatic data, we use the getData() function of the raster package to download the bioclim dataset. This dataset consists of 19 layers displaying different variables with BIO1 being the annual mean temperature and BIO12 being the annual mean precipitation. For the boundary of Parkistan we use the “GADM” attribute of getData().

parkistan <- getData("GADM", country="PK", level=0)
bioclim <- getData("worldclim", var="bio", res=0.5, lon=69.2, lat=29.5)
bioclim2 <- getData("worldclim", var="bio", res=0.5, lon=69.2, lat=33)
#2 bioclim tiles neccessary for complete coverage

#Combining both tiles
bioclim_mosaic <- mosaic(bioclim, bioclim2, fun=mean)

#crop the climatic data to Parkistan
cropped <- mask(bioclim_mosaic, parkistan)

Now we can take a look at the data which we obtained so far. The map shows that most records are in the North of the country. Since we will use the environmental variables t_mean (BIO1) and precipitation (BIO12) for our MaxEnt model, we prepare a raster stack of those two layers from our cropped BioClim dataset.

#Take a look at the result
plot(cropped$layer.1, ext=extent(parkistan), main="Location of Species Records", xlab="Longitude", ylab="Latitude")
plot(parkistan, add=T)
points(spdf, pch="+", cex=0.2, col="red")

#Prepare a stack of values for t_mean (bio1) and precipitation (bio12)
climate <- stack(cropped$layer.1, cropped$layer.12)
names(climate) <- c("t_mean", "precip")

Prepare data for MaxEnt

Preparation of presence data

For our MaxEnt model we need a dataframe of environmental variables at observed species locations and a background with randomly sampled environmental variables across our study area. Since our original data contains records of multiple species in a single dataframe, we need to split it up in multiple subsets, one for each species. This can be done with the split() function. We can get a feeling for our new dataset by using typeof() and class().

split_data <- split(data, data$species) #Split data in a list of multiple sets, according to data$species
typeof(split_data) #This object is a list

## [1] "list"

class(split_data[[1]]) #The entries of the list are dataframes

## [1] "data.frame"

split_data[1:5] #A look at the first 5 elements gives us a glimpse at what our list looks like (you should not try to print the whole list!)

## $Acraea_issoria
##          species      lon      lat
## 1 Acraea_issoria 73.08635 34.43060
## 2 Acraea_issoria 73.36326 34.24367
## 3 Acraea_issoria 73.36758 34.14221
## 4 Acraea_issoria 73.82637 33.84266
## 5 Acraea_issoria 74.19839 33.94852
## 
## $Acraea_violae
##         species      lon      lat
## 6 Acraea_violae 72.34951 31.70947
## 7 Acraea_violae 73.15228 34.63426
## 8 Acraea_violae 73.31037 34.16557
## 
## $Actinor_radians
##            species      lon      lat
## 9  Actinor_radians 71.99009 35.15735
## 10 Actinor_radians 72.07099 35.99492
## 11 Actinor_radians 73.39397 34.45065
## 
## $Acytolepis_puspa
##             species      lon      lat
## 12 Acytolepis_puspa 73.00282 33.85622
## 13 Acytolepis_puspa 73.09919 34.04539
## 14 Acytolepis_puspa 73.13845 33.85907
## 15 Acytolepis_puspa 73.34760 34.06609
## 
## $Aeromachus_stigmatus
##                 species      lon      lat
## 16 Aeromachus_stigmatus 73.16466 33.82338
## 17 Aeromachus_stigmatus 73.16823 34.13390
## 18 Aeromachus_stigmatus 73.34217 34.58814
## 19 Aeromachus_stigmatus 73.41688 34.00422
## 20 Aeromachus_stigmatus 73.51920 34.42539
## 21 Aeromachus_stigmatus 73.52206 34.28405
## 22 Aeromachus_stigmatus 73.62342 34.56244
## 23 Aeromachus_stigmatus 73.64769 33.84480
## 24 Aeromachus_stigmatus 73.83329 33.96139

In the following part we will handle our data entirely in the form of lists. To apply functions on the elements of a list we can use the lapply() function. The input of this function is: a list containing objects to apply the function to; a function which should be applied (denoted by FUN=...) and additional arguments which we need as input to our function. In this case these additional arguments are lon and lat since the next step is to extract environmental conditions at presence locations. For this we need a dataset containing only 2 columns (longitude and latitude).

locations <- lapply(split_data, FUN=select, lon, lat) #select lon and lat columns from split_data
locations[1:5] #Take a look at the result

## $Acraea_issoria
##        lon      lat
## 1 73.08635 34.43060
## 2 73.36326 34.24367
## 3 73.36758 34.14221
## 4 73.82637 33.84266
## 5 74.19839 33.94852
## 
## $Acraea_violae
##        lon      lat
## 6 72.34951 31.70947
## 7 73.15228 34.63426
## 8 73.31037 34.16557
## 
## $Actinor_radians
##         lon      lat
## 9  71.99009 35.15735
## 10 72.07099 35.99492
## 11 73.39397 34.45065
## 
## $Acytolepis_puspa
##         lon      lat
## 12 73.00282 33.85622
## 13 73.09919 34.04539
## 14 73.13845 33.85907
## 15 73.34760 34.06609
## 
## $Aeromachus_stigmatus
##         lon      lat
## 16 73.16466 33.82338
## 17 73.16823 34.13390
## 18 73.34217 34.58814
## 19 73.41688 34.00422
## 20 73.51920 34.42539
## 21 73.52206 34.28405
## 22 73.62342 34.56244
## 23 73.64769 33.84480
## 24 73.83329 33.96139

Next, we will use this list of locations to extract environmental parameters at our presence locations using extract() combined with lapply(). Note that specifying extract with raster:: is necessary because plotly, which we will use later depends on the package tidyr which also has a function extract. Otherwise you will get an error.

locations_env <- lapply(X=locations, FUN=raster::extract, x=climate) #Args X and x important --> will not work otherwise
locations_env[1:5] #Take a look at the result

## $Acraea_issoria
##      t_mean precip
## [1,]    177   1337
## [2,]    111   1305
## [3,]    157    979
## [4,]    146    548
## [5,]    101    995
## 
## $Acraea_violae
##      t_mean precip
## [1,]    244    305
## [2,]    132   1095
## [3,]    156   1049
## 
## $Actinor_radians
##      t_mean precip
## [1,]    144   1000
## [2,]    -17    662
## [3,]    161   1092
## 
## $Acytolepis_puspa
##      t_mean precip
## [1,]    204    973
## [2,]    196   1183
## [3,]    183   1062
## [4,]    144   1076
## 
## $Aeromachus_stigmatus
##       t_mean precip
##  [1,]    186   1036
##  [2,]    149   1178
##  [3,]    168   1189
##  [4,]    143   1013
##  [5,]    152    937
##  [6,]    168    979
##  [7,]     31    855
##  [8,]    158    553
##  [9,]    169    513

In the last lapply() function, denoting the arguments with X and x is very important. Although it looks strange, a capital X is the notation for lapply() while the small x defines the input for the extract() function. If you don’t pay attention to this, you will get the following error: “Fehler in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘extract’ for signature”data.frame“,”RasterStack""

So if you ever encounter error messages like this with lapply(), the reason could be something like this.

Generating Background data

For MaxEnt you will need a background sample, which means just a random sample of environmental variables across your study area. You can do this with the sampleRandom() function which will extract variables from our raster stack “climate” at random points. The attribute sp=T means it will return a SpatialPointsDataFrame. Note that the number which is set in the formula does not resemble the number of points which are actually sampled. Setting the value to 100 actually yields only around 20 samples (varies every time). The probable reason for this is that we sample from a raster which has been cropped to the extent of Parkistan and is not rectangular. This has to be kept in mind when thinking about which number of sample points you want to choose. Since our number of species values is rather low, we will also choose a rather low value. In this case, a value of 3000 was chosen, resulting in ~500 actual samples. When having a larger dataset, background samples of n = 10,000 are common.

bg <- sampleRandom(x=climate, 100, sp=T)
bg #only ~20 sample points are returned

## class       : SpatialPointsDataFrame 
## features    : 13 
## extent      : 66.0125, 74.7375, 26.49583, 36.4375  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +datum=WGS84 +no_defs 
## variables   : 2
## names       : t_mean, precip 
## min values  :    -43,    103 
## max values  :    269,    374

bg <- sampleRandom(x=climate, 3000, sp=T)
bg #around 500 sample locations this time

## class       : SpatialPointsDataFrame 
## features    : 562 
## extent      : 61.57083, 77.37083, 24.00417, 36.85417  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +datum=WGS84 +no_defs 
## variables   : 2
## names       : t_mean, precip 
## min values  :   -143,     46 
## max values  :    276,   1148

#Plot t_mean with bg points to get a feeling for the data and presence locations in red for comparison
plot(climate$t_mean, ext=extent(parkistan), main="Presence and background locations")
points(bg, pch="+", cex=0.5)
points(spdf, pch="+", cex=0.2, col="red") #Plot spatialPointsDataFrame (spdf) of our species records

The way in which the MaxEnt(x, p) function works is as follows: x is a data frame containing environmental variables at locations of both presence and background samples. p (for “presence”) is a vector consisting of 1 if a species is present and 0 if it is a background sample. In this way the data from “x” is assigned to presence/background locations.

Thus, for our MaxEnt model to be run, we need to combine the presence locations with our background. This can be done by using rbind with lapply(). Note that this will add the bg values at the end of the presence locations of locations_env_bg. This means that the first values are presence and the last ones are bg data.

bg_env <- extract(climate, bg) #Extract environmental data at background locations
head(bg_env) #take a look at result

##      t_mean precip
## [1,]    253    153
## [2,]    234    432
## [3,]    265    139
## [4,]    241    253
## [5,]    224    121
## [6,]    267    145

locations_env_bg <- lapply(X=locations_env, FUN=rbind, bg_env) #Attach environmental data of background samples AT THE END of each element of the list locations_env

Next, we need to generate a vector which tells the algorithm, if the entries in locations_env_bg are presence (1) or absence(0) locations. We can do this with three lapply functions which are stacked in one another:

The first takes locations_env and executes the function nrow() on each element returning a list of the number of rows of each element of the input-list (do you understand this sentence?)
The second lapply() takes this output list and executes rep() on each element. x=1 denotes that 1 is the number you want to repeat, X=“previous lapply output” means that you want to repeat 1 as often as the value of each element of X. In this case, denoting arguments with x and X is important. Note the difference between a capital X and a small x or you will have trouble.
The last lapply function takes the list of vectors of the kind c(1,1,1) and combines them (FUN=c) with another vector of zeros with a length of nrow(bg). This one (bg_0) is created before with the rep function (this vector stays the same for every element of the list we want to attatch it to).

Printing the first two elements of the final output shows that we obtained the kind of data we wanted (1 at the beginning and the 0 which comes from the bg data at the end).

bg_0 <- rep(0, nrow(bg)) #Returns a vector of zeros with the length of bg

p_bg <- lapply(X=lapply(                             #Third lapply -> X=output of second one
                        FUN=rep, x=1, X=lapply(      #Second lapply -> X=output of first one
                        X=locations_env, FUN=nrow)), #First lapply!
                        FUN=c, bg_0)                 #Belongs to third lapply

p_bg[1] #A lot of ones and zeros

## $Acraea_issoria
##   [1] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [556] 0 0 0 0 0 0 0 0 0 0 0 0

Run MaxEnt

With this data we can finally run our MaxEnt model. Now you have to use mapply instead of lapply because we have 2 lists we want to apply to each other; the first one being the list of our environmental parameters on presence and bg locations, locations_env_bg and the second one being our list of vectors of 1 and 0, p_bg, assigning whether it’s a presence or bg location. Since MaxEnt requires a data frame, we wirst need to convert locations_env_bg using lapplyand as.data.frame. Running the below code works at first but unfortunately, after a couple of entries, the code gives an error saying that there are a different number of rows. For educational purposes I decided to include my bugfixing in the tutorial and show how I overcame this problem and made it work.

locations_env_bg_df <- lapply(locations_env_bg, FUN=as.data.frame) #next function requires data frame
maxent_butterflies <- mapply(FUN=maxent, locations_env_bg_df, p=p_bg) #Functions stops after several runs

First, I wanted to find out, whether a single entry works. Indeed, the first entry of each list can be run without problems.

maxent_butterflies_1 <- maxent(locations_env_bg_df[[1]], p=p_bg[[1]]) #Single entry works well

## This is MaxEnt version 3.4.1

maxent_butterflies_1

## class    : MaxEnt 
## variables: t_mean precip

Then, I checked, if the number of rows of our environmental variables locations_env_bg_df and our presence/absence vector p_bg are equal. This can be done by lapplying several functions on our two lists locations_env_bg_df and p_bg. nrow() returns the number of rows of each element, while equality() subsequently checks if the number of rows of each pair of elements in our 2 lists is equal. Ultimately, which(equality==FALSE) returns every element where our 2 lists are not equal in length. length(equality_false) shows that no entry has a different number of rows. This should be double checked by length(which(equality==TRUE)), which shows that 429 elements have the same number of rows, which is the expected number.

#ist nrow von jedem Element von locations env bg gleich lang wie von p_bg
nrow1 <- lapply(locations_env_bg_df, FUN=nrow)
nrow2 <- lapply(p_bg, FUN=length)

equality <- mapply(nrow1, nrow2, FUN=identical)

equality_false <- which(equality==FALSE)
length(equality_false) #No entry is FALSE

## [1] 0

length(which(equality==TRUE)) #Double check if function works properly -> 429 entries are correct

## [1] 429

Since the number of rows of each element is equal (what we would expext, because p_bg was generated from locations_env_bg and therefore must be equal), it is strange that we get an error when running MaxEnt. Next we can try if mapply works in general with a reduced number of list elements and then identify a list element which does not work. Indeed, entries 251:268 work while 269 does not. And 271:272 works while 273 does not.

maxent_butterflies_251_268 <- mapply(FUN=maxent, locations_env_bg_df[251:268], p=p_bg[251:268])
length(maxent_butterflies_251_268) #251-268 work

maxent_butterflies_269 <- maxent(locations_env_bg_df[[269]], p=p_bg[[269]]) #269 does not work

maxent_butterflies_271_272 <- mapply(FUN=maxent, locations_env_bg_df[271:272], p=p_bg[271:272])
length(maxent_butterflies_271_272) #271-272 work

maxent_butterflies_273 <- maxent(locations_env_bg_df[[273]], p=p_bg[[273]]) #273 does not work

So next we can figure out what is wrong with entries 269 and 273. When looking at the values of locations_env_bg_df[[269]] we find out that this one contains a NA value. The same accounts for entry 273. This might lead to problems.

head(locations_env_bg_df[[269]]) #contains NA value

##   t_mean precip
## 1     NA     NA
## 2    253    153
## 3    234    432
## 4    265    139
## 5    241    253
## 6    224    121

head(locations_env_bg_df[[273]]) # also contains NA value

##   t_mean precip
## 1     NA     NA
## 2    253    153
## 3    234    432
## 4    265    139
## 5    241    253
## 6    224    121

But why do we have NA values? We can take a look at a species which does not work, P. afghana. When plotting the species locations zooming in to the extent of ext=c(70, 72, 35, 37) we see that this point is outside of our research area and this leads to NA values.

P_afghana <- filter(data, species=="Paralasa_afghana")

#Zoom in to the point
plot(climate$t_mean, ext=c(70, 72, 35, 37))
points(cbind(P_afghana$lon, P_afghana$lat), pch="+", col="red")

#Point is out of range

So the problem was, that we used a masked version of our climate data, where some points are actually a little bit out of range. So when you encounter strange error messages like this in MaxEnt, this could be the reason for such problems.

Thus we need to do everything again, but instead we use crop() on our environmental raster to ensure that every point is in range.

cropped_2 <- crop(bioclim_mosaic, parkistan)

plot(cropped_2$layer.1, ext=extent(parkistan), main="Location of Species Records", xlab="Longitude", ylab="Latitude")
plot(parkistan, add=T)
points(spdf, pch="+", cex=0.2, col="red")

climate_2 <- stack(cropped_2$layer.1, cropped_2$layer.12)
names(climate_2) <- c("t_mean", "precip")

#Some steps are missing which we don't need to do again (see previous sections)
locations_env_2 <- lapply(X=locations, FUN=extract, x=climate_2)

bg_2 <- sampleRandom(x=climate_2, 5000, sp=T)
#Need new bg sample though because we have a larger extent of env data now, number of points increased from 3000 to 5000!

#Plot t_mean with bg points to get a feeling for the data and presence locations in red for comparison
plot(climate_2$t_mean, ext=extent(parkistan))
plot(parkistan, add=T)
points(bg_2, pch="+", cex=0.5)
points(spdf, pch="+", cex=0.2, col="red") #Plot spatialPointsDataFrame (spdf) of our species records

bg_env_2 <- extract(climate_2, bg_2) #Extract environmental data at background locations

locations_env_bg_2 <- lapply(X=locations_env_2, FUN=rbind, bg_env_2) #combine presence data with bg samples

bg_0_2 <- rep(0, nrow(bg_2)) #Returns a vector of zeros with the length of bg

p_bg_2 <- lapply(X=lapply(                             #Third lapply -> X=output of second one
                        FUN=rep, x=1, X=lapply(      #Second lapply -> X=output of first one
                        X=locations_env_2, FUN=nrow)), #First lapply!
                        FUN=c, bg_0_2)                 #Belongs to third lapply; bg_0 -> vector of zeros

locations_env_bg_df_2 <- lapply(locations_env_bg_2, FUN=as.data.frame) #next function requires data frame

With our new data, our MaxEnt modeling should work now.

maxent_butterflies <- mapply(FUN=maxent, locations_env_bg_df_2, p=p_bg_2) #This should work now

Checking the length of our list of MaxEnt results, we see 429 entries, which is the correct number, so enerything worked now.

length(maxent_butterflies) #output = 429 entries -> correct

## [1] 429

Prediction of species distributions

The next step is the prediction of butterfly distributions using our MaxEnt models. This will take a while and your computer might struggle, so in case of a crash, you can save your environment using save.image. The, if your computer crashes, you can continue immediately where you left without having to run everything again.

save.image(file="environment.RData") #Save Environment in case of crash
load("environment.RData") #To load environment

For the final prediction, we can use predict() using a for loop. Lapply can not be used here because it will crash your computer due to massive amounts of RAM being used. With a for loop, we can predict each element individually and save the .tif file to make sure our progress is saved. Attention: When running predict(), you can do a very long break since this will take quite a while.

setwd("./prediction") #first create subfolder for .tif files

#Predict with a for loop - this will take some time
for(i in seq_along(maxent_butterflies)){
  output <- predict(maxent_butterflies[[i]], climate)
    writeRaster(output, filename=paste(names(maxent_butterflies[i]), ".tif"))
}

After predict() finished, you can load the generated tif files using list.files() which generated a list of filenames and then lapply() the function raster() to this list to load every previously generated raster.

setwd("./prediction")
output_list <- list.files(pattern='.tif', all.files=TRUE, full.names=FALSE) #Create list of all file names

length(output_list) #Check result -> should be 429

## [1] 429

rasters <- lapply(output_list, FUN=raster) #Load every .tif file with the funciton raster()
raster_stack <- stack(rasters) #Generate a raster stack from the list "rasters"

To get a feeling for our data, we can plot every element using par(mfrow=c(2,4)) to make sure we can plot 4 images next to each other and save space. As you can see, results vary. Some seem to be quite a good prediction, but some maps are also strange and don’t really show a distrubution but rather a uniform value all over the map. This is most likely due to a low number of samples which the model can’t handle. From now on, these examples will be refered to as “strange maps” and we will look, if we can avoid strange maps when using different MaxEnt parameters.

par(mfrow=c(2,4))
for(i in seq_along(as.list(raster_stack))){
plot(as.list(raster_stack)[[i]], ext=extent(parkistan), main=names(as.list(raster_stack)[[i]]))
  points(as.list(split_data)[[i]]$lon, as.list(split_data)[[i]]$lat, pch="+", cex=0.8)
}

Generating a species richness map

The next step is the generation of the final species richness map using sum(). Using sum() on the whole did not work for me because it would fill my Temp folder with almost 40 GB of storage, I decided to run the sum in two parts. First, the first 2 layers of the stack will be summed and then the next layers (until 300) will be added. The same is done to the layers 301-429 and then the two sums richness_1 and richness_2 will be summed to obtain the final species richness map richness.

richness_1 <- sum(raster_stack[[1]], raster_stack[[2]])
for(i in 3:300){
  richness_1 <- sum(richness_1, raster_stack[[i]])
  print(paste("Element", i, "has been processed."))
}
writeRaster(richness_1, filename="richness_1.tif")


richness_2 <- sum(raster_stack[[301]], raster_stack[[302]])
for(i in 303:nlayers(raster_stack)){
  richness_2 <- sum(richness_2, raster_stack[[i]])
  print(paste("Element", i, "has been processed."))
}
writeRaster(richness_2, filename="richness_2.tif")

richness <- sum(richness_1, richness_2)
writeRaster(richness, filename="richness_map.tif")

plot(richness, ext=extent(parkistan), main="Species richness of butterflies in Parkistan")

We can also divide the species richness map by the largest value to obtain a relative species richness.

richness_relative <- richness/maxValue(richness)
plot(richness_relative, ext=extent(parkistan), main="Relative species richness of butterflies in Parkistan")

Optimizing the MaxEnt method

After we finished our actual goal - generating a species richness map - we can take a look how different conditions influence our result and if we can optimize the model.

Taking a look at MaxEnt results

After we actually finished our final goal - generating a species richness map - we can further assess our method and investigate how differences in input data or MaxEnt parameters can influence our result. For a detailed output, maxent() data offers the option @results to obtain a detailed analysis of our MaxEnt model.

maxent_butterflies[[1]]@results

##                                                                                         [,1]
## X.Training.samples                                                                    5.0000
## Regularized.training.gain                                                             0.7305
## Unregularized.training.gain                                                           1.0612
## Iterations                                                                           60.0000
## Training.AUC                                                                          0.9369
## X.Background.points                                                                5005.0000
## precip.contribution                                                                  94.8416
## t_mean.contribution                                                                   5.1584
## precip.permutation.importance                                                       100.0000
## t_mean.permutation.importance                                                         0.0000
## Entropy                                                                               7.7877
## Prevalence..average.probability.of.presence.over.background.sites.                    0.2953
## Fixed.cumulative.value.1.cumulative.threshold                                         1.0000
## Fixed.cumulative.value.1.Cloglog.threshold                                            0.1362
## Fixed.cumulative.value.1.area                                                         0.9664
## Fixed.cumulative.value.1.training.omission                                            0.0000
## Fixed.cumulative.value.5.cumulative.threshold                                         5.0000
## Fixed.cumulative.value.5.Cloglog.threshold                                            0.1508
## Fixed.cumulative.value.5.area                                                         0.8422
## Fixed.cumulative.value.5.training.omission                                            0.0000
## Fixed.cumulative.value.10.cumulative.threshold                                       10.0000
## Fixed.cumulative.value.10.Cloglog.threshold                                           0.1697
## Fixed.cumulative.value.10.area                                                        0.7041
## Fixed.cumulative.value.10.training.omission                                           0.0000
## Minimum.training.presence.cumulative.threshold                                       43.4822
## Minimum.training.presence.Cloglog.threshold                                           0.3995
## Minimum.training.presence.area                                                        0.1870
## Minimum.training.presence.training.omission                                           0.0000
## X10.percentile.training.presence.cumulative.threshold                                43.4822
## X10.percentile.training.presence.Cloglog.threshold                                    0.3995
## X10.percentile.training.presence.area                                                 0.1870
## X10.percentile.training.presence.training.omission                                    0.0000
## Equal.training.sensitivity.and.specificity.cumulative.threshold                      43.5033
## Equal.training.sensitivity.and.specificity.Cloglog.threshold                          0.3995
## Equal.training.sensitivity.and.specificity.area                                       0.1868
## Equal.training.sensitivity.and.specificity.training.omission                          0.2000
## Maximum.training.sensitivity.plus.specificity.cumulative.threshold                   43.4822
## Maximum.training.sensitivity.plus.specificity.Cloglog.threshold                       0.3995
## Maximum.training.sensitivity.plus.specificity.area                                    0.1870
## Maximum.training.sensitivity.plus.specificity.training.omission                       0.0000
## Balance.training.omission..predicted.area.and.threshold.value.cumulative.threshold   10.8706
## Balance.training.omission..predicted.area.and.threshold.value.Cloglog.threshold       0.1752
## Balance.training.omission..predicted.area.and.threshold.value.area                    0.6821
## Balance.training.omission..predicted.area.and.threshold.value.training.omission       0.0000
## Equate.entropy.of.thresholded.and.original.distributions.cumulative.threshold        20.7352
## Equate.entropy.of.thresholded.and.original.distributions.Cloglog.threshold            0.2479
## Equate.entropy.of.thresholded.and.original.distributions.area                         0.4815
## Equate.entropy.of.thresholded.and.original.distributions.training.omission            0.0000

The most interesting elements of maxent_butterflies@results include the number of training samples, the number of iterations, the AUC (area under curve of the model evaluation) and the contribution of our 2 environmental variables, meaning how much each variable contributed to the MaxEnt model. These parameters are the entries 1, 4, 5, 7 and 8 of @results output. To further analyze our MaxEnt model, we can perform a for loop to load interesting data in a data frame.

##CREATE TABLE OF RESULTS
results_df <- data.frame(species=NA, training_samples=NA, iterations=NA, training_AUC=NA, precip_contr.=NA, t_mean_contr.=NA)

for(i in seq_along(maxent_butterflies)){
  results_df[i,1] <- names(maxent_butterflies[i])
  for(j in c(1,4,5,7,8)){
    if(j==1){n=2}
    if(j==4){n=3}
    if(j==5){n=4}
    if(j==7){n=5}
    if(j==8){n=6}
  results_df[i,n] <- maxent_butterflies[[i]]@results[j]}
  }

Now, our results_df contains every parameter we are interested in. I use the package reactable for having an interactive table in my final R markdown document.

reactable(results_df)

Interactive visualisation with Plotly

Since this is quite a large table, we can also generate an interactive 3D-plot using the package plotly. With this package, you can create beautiful plots which are interactive (for example you can zoom in or over over your data points to see the exact x, y and z values). Plotly is a very nice package and you should definitely look into it!

In our map, we would also like to choose a different symbol for strange maps. As can be seen from our results_df, strange maps have 0 contribution of both, precipitation and t_mean, which does not make sense. To better investigate these strange maps, we create a new df df_new from results_df, which contains an additional column of the sum of precip_contribution and t_mean_contribution. If this value is 0%, it is a strange map, if it is 100%, it is not.

#To be able to choose another symbol for strange spots
df_new <- mutate(results_df, t_mean_prec=results_df$precip_contr.+results_df$t_mean_contr.)

Now we can crate our interactive plot using plot_ly. On the x-axis we plot t_mean_contr. (the difference to 100% is thus the amount of precip_contr.), the y-axis will be the number of iterations and the z-axis will be the number of training samples. Additionally, we color the symbols according to the AUC and using symbol=~t_mean_prec to get different symbols for strange maps and normal ones. A normal point will be a normal map and a cross will be a strange map. Also, hovertemplate defines the text which will be shown when hovering over a data point in the plot.

fig <- plot_ly(df_new, x = ~t_mean_contr., y = ~iterations, z = ~training_samples,
               marker = list(color = ~training_AUC, size=4, colorscale = "Jet", showscale = TRUE), symbol=~t_mean_prec, symbols=c("x", "circle"), mode="markers", text=~species, hovertemplate=paste("<i><b>%{text}</b></i><br><br>", "<b>Samples:</b> %{z}<br>", "<b>t_mean contrib.:</b> %{x:.1f}%<br>", "<b>Iterations:</b> %{y}<br>", "<b>AUC:</b> %{marker.color}"))
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene = list(xaxis = list(title = 't_mean contribution'),
                                   yaxis = list(title = 'Iterations'),
                                   zaxis = list(title = 'Training samples')),
                      annotations = list(
                        x = 1.13,
                        y = 1.05,
                        text = 'Training_AUC',
                        xref = 'paper',
                        yref = 'paper',
                        showarrow = FALSE
                        ))
fig

Observations from plot:

A relatively large amount of values is at extreme values (0 or 100%) of t_mean contribution.
With small sample sizes the chance of getting a low AUC is higher (which we would expect).
Large sample sizes can sometimes yield low AUC as well.
All “strange” maps have a sample size of 1 or 2.
Next: A test dataset will be chosen to optimize the method.

The next step will be to choose a test dataset consisting of different species representing different characteristics. With this test dataset we will optimize the method using different parameters and MaxEnt features.

The samples chosen for the test dataset are:

Zizeeria karsandra: High sample size but low AUC (0.74)
Danaus chrysippus
Pseudochazara thelephassa: Low sample size and 100% t_min contribution.
Pontia glauconome: Low AUC and medium t_min contribution.
Hyponephene interpostia: Low AUC and high precip contribution.
Vanessa atalanta: Very low AUC and 100% precip contribution.
Tarucus rosaceus: Produced a strange map.
And 5 control samples which produced a good result using MaxEnt default settings
Colias erate: Large sample size.
Pseudochazara lehana: Medium sampe size and almost 100% t_mean contribution.
Ypthima avanta: Low sample size and almost 100% precip contribution.
Melitaea chitralensis: Produced almost a AUC of 1.
Pieris rapae: In the middle.

Optimization screening

Next, we will create a df of our test species:

#Create a list of the environmental variables for each test species
test_data <- list(locations_env_2$Zizeeria_karsandra, locations_env_2$Danaus_chrysippus, locations_env_2$Pseudochazara_thelephassa, locations_env_2$Pontia_glauconome, locations_env_2$Hyponephele_interposita, locations_env_2$Vanessa_atalanta, locations_env_2$Tarucus_rosaceus, locations_env_2$Colias_erate, locations_env_2$Pseudochazara_lehana, locations_env_2$Ypthima_avanta, locations_env_2$Melitaea_chitralensis, locations_env_2$Pieris_rapae)

#create vector of test species names
species_names_test <- c("Zizeeria_karsandra", "Danaus_chrysippus", "Pseudochazara_thelephassa", "Pontia_glauconome", "Hyponephele_interposita", "Vanessa_atalanta", "Tarucus_rosaceus", "Colias_erate", "Pseudochazara_lehana", "Ypthima_avanta", "Melitaea_chitralensis", "Pieris_rapae")

#set names of our list
names(test_data)<- species_names_test

With our test_data we will analyze the influence of the following parameters

Influence of background sampling.
Influence of different environmental variables.
Influence of different MaxEnt features.

Testing influence of background samples

In the previous MaxEnt run, we took a bg sample of our cropped climate raster which in turn generates bg samples all over the rectangular(!) area around Parkistan. We did this because some samples are outside the borders of Parkistan. The problem is, that our sample points are still very close to Parkistan’s border while our background samples cover a wide area around parkistan where no species records are present. This can lead to errors since our bg might not be representative for our presence data. To check this influence, we will crop our bg data to the extent of Parkistan using gIntersection from the rgeos package.

bg_test <- gIntersection(parkistan, bg_2) #From rgeos package

plot(parkistan)
points(bg_test, pch="+", cex=0.5)

Now we can use the procedure we are already familiar with for our MaxEnt modeling:

Extracting environmental parameters from bg locations.
Using rbind to combine environmental data of our test data with our bg samples.
Creating vector of 1/0 presence/absence.
Run MaxEnt.

bg_test_env <- extract(climate, bg_test)
p_bg_env_test <- lapply(test_data, FUN=rbind, bg_test_env) #combine presence and bg environmental data


bg_0_test <- rep(0, nrow(as.data.frame(bg_test)))
pa_test <- lapply(X=lapply(                     #generate presence (1) and absence (0) vectors (see previous section)
                  FUN=rep, x=1, X=lapply(        
                  X=test_data, FUN=nrow)),  
                  FUN=c, bg_0_test)  

p_bg_env_test_df <- lapply(p_bg_env_test, FUN=as.data.frame) #next step requires dataframe
maxent_cropped_bg <- mapply(FUN=maxent, p_bg_env_test_df, p=pa_test)

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

Finally, we can predict species occurence on out test data using predict():

predict_cropped_bg <- lapply(maxent_cropped_bg, FUN=predict, climate)

The next step will be to extract the values from the @results table of our MaxEnt data. Since we might need this step again for further optimization screening, I decided to create a new function for it using function(). With this function, you can create new functions. In the brackets after function() are the required input data, so in our case our MaxentData (as a list!). The procedure is the same as for our last extraction of results (see above), but only embedded in function(): First we create an empty df with all our colnames; species, training_samples, iterations, training_AUC and additionally we use rownames(maxent_data[[1]]@results)[7] and ...[8]to get the contribution of our environmental variables, which are at the position 7 and 8 of @results (when you are using 2 variables!). We will also use gsub to shorten the colnames from “contibution” to “contr.”. return() at the end of our functions specifies, which value should be returned by your function. In our case, this is the data frame of the variables we are interested in.

DanielExtract <- function(maxent_data){
extraction <- data.frame(NA, NA, NA, NA, NA, NA) #create empty data frame
colnames(extraction) <- c("species", "training_samples", "iterations", "training_AUC", rownames(maxent_data[[1]]@results)[7], rownames(maxent_data[[1]]@results)[8])
colnames(extraction) <- gsub(".contribution", "_contr.", colnames(extraction))

for(i in seq_along(maxent_data)){
  extraction[i,1] <- names(maxent_data[i]) #Writes species name in first column
  for(j in c(1,4,5,7,8)){
    if(j==1){n=2}
    if(j==4){n=3}
    if(j==5){n=4}
    if(j==7){n=5}
    if(j==8){n=6}
  extraction[i,n] <- maxent_data[[i]]@results[j]}
}
return(extraction)
}

Writing functions like this is a very convenient way, when you have to use the same steps over and over again and want to save code. Now, with DanielExtract(), we can extract our values from @results in one simple line instead of having a ton of code:

results_cropped_bg <- DanielExtract(maxent_cropped_bg)
reactable(results_cropped_bg)

Now we want to compare our results after optimization with our original data. We could use cbind to combine our two data frames, but since we have a lot of columns, it would be not very easy to compare each individual entry before and after optimization. It would be easier, if we would merge both data frames in a new df, which has alternating rows of data before and after optimizing, so the values we want to compare are right next to each other. For this, we can also write a function DanielMerge, which takes rows of each data frame and puts them together in an alternating fashion. We use i*2 and (2*i-1) as denominators for the rows of the new data frame, because like this we can get a sequence of 1, 2 for i=1 and 3, 4 for i=2 and so on.

DanielMerge <- function(df1, df2){
newdf <- data.frame(species=NA, training_samples=NA, iterations=NA, training_AUC=NA, precip_contr.=NA, t_mean_contr.=NA)
for(i in c(1:(nrow(df1)))){
    n=(i*2)
newdf[(2*i-1),] <- df1[i,]
newdf[n,] <- df2[i,]
if(i==(nrow(df1))) {break}
}
return(newdf)}

In order for DanielMerge to work we first need a data frame of the original results (before optimization) of our test data set and we need to make sure that the species are in the same order in our both df.

test_names <- c("Zizeeria_karsandra", "Danaus_chrysippus", "Pseudochazara_thelephassa", "Pontia_glauconome", "Hyponephele_interposita", "Vanessa_atalanta", "Tarucus_rosaceus", "Colias_erate", "Pseudochazara_lehana", "Ypthima_avanta", "Melitaea_chitralensis", "Pieris_rapae")

test_data_results <- filter(results_df, species %in% test_names) #Filter large dataset for test data
test_data_results_arranged <- test_data_results[match(test_names, test_data_results$species),] #Arrange df to have the same order as the output of predictions from our test dataset --> two species will be next to each other

Now we can run DanielMerge on our two data frames to compare our results before and after optimization. Note that in our final data frame the first entry of each species is always the parameter before optimization and the second one is after optimization.

#Now run DanielMerge
comparison_cropped_bg <- DanielMerge(test_data_results_arranged, results_cropped_bg)
reactable(comparison_cropped_bg)

Comparison of results with a plot

For a more intuitive comparison, we can also create a plot of our two results so we can visually interpret differences. For this we can also create a new function which will automatically create a plot for comparison. We will create 2 different functions, DanielPlot and DanielPlot2. The first will create a plot which we already created above and is thought for visualization of one dataset (no comparison!). On the x-axis will be the t_mean contribution, on the y-axis the number of iterations, on the z-axis the number of training samples, also the symbol will be different for strange maps (a cross) and normal maps (a circle) and the color will correspond to the training AUC. We will not need this function now, but it is maybe good to have it in hand, if you want to visualize MaxEnt results in a very fast way.

DanielPlot <- function(df){
  t_p <- data.frame(t_mean_prec=(df$precip_contr.+df$t_mean_contr.))
  df_mod <- cbind(df, t_p)
  plot <- plot_ly(df_mod, x = ~t_mean_contr., y = ~iterations, z = ~training_samples,
               marker = list(color = ~training_AUC, size=4, colorscale = "Jet"), symbol=~t_mean_prec, symbols=c("x", "circle"), mode="markers", text=~species, hovertemplate=paste("<i><b>%{text}</b></i><br><br>", "<b>Samples:</b> %{z}<br>", "<b>t_mean contrib.:</b> %{x:.1f}%<br>", "<b>Iterations:</b> %{y}<br>", "<b>AUC:</b> %{marker.color}"))
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene = list(xaxis = list(title = 't_mean contribution'),
                                   yaxis = list(title = 'Iterations'),
                                   zaxis = list(title = 'Training samples')))
return(plot)
}

dp <- DanielPlot(comparison_cropped_bg)
dp

For comparison of our data before and after optimization, we will create DanielPlot2 which uses 2 different datasets (merged together with DanielMerge). The visualization will be a little bit different as with DanielPlot and we will have the same x- and z-axis, but on the y-axis we will plot the training AUC. Also, the color will now indicate the species which we transfer to a numeric value as species_indicator so the function can handle the different species and color the points according to different species. We will also use the b_a_indicator as a denominator whether the corresponding value is before (1) or after (0) the optimization. In this way, we can use different symbols for data before (points) and after (crosses) optimization.

DanielPlot2 <- function(dataf, title){
  dataf_num_spec <- cbind(dataf, data.frame(species_indicator=as.vector(rbind(c(1:(nrow(dataf)/2)), c(1:(nrow(dataf)/2))))), b_a_indicator=rep(c(1,2), (nrow(dataf)/2)))
  
  plot <- plot_ly(dataf_num_spec, 
                  x = ~t_mean_contr., y = ~training_AUC, z = ~training_samples, 
               marker = list(color = ~species_indicator,  size=5,  colorscale = "Jet"),  symbol=~b_a_indicator, symbols=c("circle", "x"), mode="markers", text=~species, hovertemplate=paste("<i><b>%{text}</b></i><br><br>", "<b>Samples:</b> %{z}<br>", "<b>t_mean contrib.:</b> %{x:.1f}%<br>", "<b>Training AUC:</b> %{y:.2f}<br>"))
plot <- plot %>% layout(title=title, scene = list(xaxis = list(title = 't_mean contribution'),
                                   yaxis = list(title = 'Training AUC'),
                                   zaxis = list(title = 'Training samples')))
return(plot)
}

Now we can compare our results before and after optimization using DanielPlot2

DanielPlot2(comparison_cropped_bg, title="Comparison cropped bg")

As we can see, usage of the different background has a dramatic influence on many species. Although some species mostly stay on the same spot like, for instance, Zizeeria karsandra, many other species move a lot in the graph, for example Pontia glauconome. Differences are especially visible in species with a low sample size.

Use of landcover data

Next we can try to use an example of categorical data rather than continuous data like precipitation or temperature. We will choose landcover as a next parameter to test its influence on butterfly distribution. You can download landcover data from diva-gis.org.

land_cover <- raster("./parkistan_landcover/PAK_msk_cov.gri")

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj
## = prefer_proj): Discarded datum Unknown based on WGS84 ellipsoid in Proj4
## definition

plot(land_cover, ext=extent(parkistan))

The data originally comes from the GLC200 (“global land cover”) and you can obtain the legend explaining the values from the GLC documentation:

The values are ranging from 1 to 22 but this might be not a good range for the use in our model because the numbers of categories are very close to each other, although they are very different. So our model might not be able to distinguish between the categories very well. That’s why we will reclassify the map. First, we can take a look at the distribution of classes using a histogram:

hist(land_cover, n=22)

So we have a lot of cultivated and managed areas (16) and also a lot of shrub cover. We will use reclassify(..., c(a,b,c)) to reclassify our map where the value of c should be assigned to the range between a and b. We will reclassify the values from 1 to 7 (different kinds of forests/trees) to a value of 100, because this might be best for insects. The values 8-13 and 17-18 will be reclassified to a value of 70, because these are different kinds of shrub cover and this might also be well for species. 14-16 will be assigned to a value of 40, because these are cultivated and managed areas which might be not very well suited but still better than cities. 19-22 will be assigned to a value of 0 because this is barren land or cities which are not suited for butterflies. Before reclassifying, we also need to use resample() so we can stack the landcover data with out t_mean data later on (otherwise we will get an error message).

resampled <- resample(land_cover, climate$t_mean) #need resample to stack both layers. Otherwise -> "different extent"
land_cover_reclassified <- reclassify(resampled, c(1,7,100, 8,13,70, 14,16,40, 17,18,70, 19,22,0)) #resampled land_cover file
plot(land_cover_reclassified, ext=extent(parkistan), main="Land cover reclassified")

Now, we can prepare a raster stack of our landcover data and our second environmental variable, t_mean to use it in MaxEnt. To extract environmental parameters from our raster stack, we need presence locations of our test species. We can do this with a for loop.

tmean_landcover <- stack(climate$t_mean, resampled)

test_locations <- list() #Generate empty list
for(i in species_names_test){ #Write locations of test dataset in list. "species_names_test=vector of species
test_locations[[i]] <- locations[i]
}

LazyMaxEnt() for a very lazy modeling

Since all the data preparation before we can use MaxEnt is quite a lot of work and we will do this again for more environmental variables, I again decided to write a new function LazyMaxEnt() to make everything more convenient. In this function, you only need your background locations which we samples earlier using sampleRandowm(), your environmental variable (as a raster or raster stack) and your presence locations of species records (as a list) and the function will do the rest. The if/else statement at the end of the code gives us the opportunity to specify MaxEnt features which we will also test later. If no args are specified, MaxEnt will be run with the default settings.

lazy_MaxEnt <- function(bg, env.var, locations, args){ #input has to be presence locations
  bg_df <- as.data.frame(bg)
  names(bg_df) <- c("lon", "lat")
  bg_env <- extract(env.var, bg_df)
  p_env <- lapply(X=lapply(locations, FUN=as.data.frame), FUN=extract, x=env.var)
  p_bg_env <- lapply(p_env, FUN=rbind, bg_env) #data has to be list of env presence data
  bg0 <- rep(0, nrow(as.data.frame(bg)))
  pa <- lapply(X=lapply(                     
                  FUN=rep, x=1, X=lapply(        
                  X=p_env, FUN=nrow)),  
                  FUN=c, bg0)
  p_bg_env_df <- lapply(p_bg_env, FUN=as.data.frame)
  
  if(missing(args)){me_result <- mapply(FUN=maxent, p_bg_env_df, p=pa)}
     else {me_result <- mapply(FUN=maxent, p_bg_env_df, p=pa, args=args)}
    return(me_result)
}

With this function we can do a very fast MaxEnt modeling in just one line of code:

maxent_t_mean_landcover <- lazy_MaxEnt(bg=bg_test, env.var=tmean_landcover, locations=test_locations)

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

Using DanielExtract() we can take a look at our results. Some species have 100% t_mean contribution while others have a lot of landcover contribution. We can also see that there is again a strange map having 0% of both.

results_t_mean_landcover <- DanielExtract(maxent_t_mean_landcover)
reactable(results_t_mean_landcover)

DanielMerge2 for merging two data frames with different environmental variables

Now we again want to merge the tables before and after optimizing together to compare them. Since DanielMerge() was designed to merge two data frames with equal columns, this can not be applied here or otherwise we will have precipittaion contribution of our old data and landcover contribution of our new data in the same column. So we will create DanielMerge2() for the use on data with different variables. The difference is that this will expand each data frame for one column and we can write the dofferent variables of our data frames in different columns.

DanielMerge2 <- function(df1, df2){
newdf <- data.frame(NA, NA, NA, NA, NA, NA, NA) #create empty data frame
colnames(newdf) <- c("species", "training_samples", "iterations", "training_AUC", colnames(df1[5]),colnames(df2[5]), colnames(df2[6]))

colnames(newdf) <- gsub(".contribution", "_contr.", colnames(newdf))
colnames(newdf) <- gsub("PAK_msk_cov_contr.", "land_cover_contr.", colnames(newdf))
df1_ordered <- df1[order(df1$species),]
df2_ordered <- df2[order(df2$species),]

df1_expand <- mutate(df1_ordered, "New_col"=NA, .after = colnames(df1_ordered[5]))
df2_expand <- mutate(df2_ordered, "New_col"=NA, .after = colnames(df2_ordered[4]))

for(i in c(1:(nrow(df1_expand)))){
    for(k in c(0:((nrow(df1_expand)*2)-1))){
    n=(i*2)
newdf[(2*i-1),] <- df1_expand[i,]
newdf[n,] <- df2_expand[i,]
if(i==(nrow(df1_expand))) {break}
}}
return(newdf)
}

The input of this function is DanielExtract data of our new MaxEnt data and the old one we want to compare it with. Using this function on our MaxEnt data of our cropped_bg (using t_mean and precipitation) and our landcover MaxEnt model, we can get the following table:

merge_t_mean_landcover <- DanielMerge2(results_cropped_bg, results_t_mean_landcover)
reactable(merge_t_mean_landcover)

So we can see that some values moved from one extreme to the other. For better visual comparison, we can use DanielPlot2. Note that the first values in our merged data frame (in our case this is precipitation) will be represented by circles and the second ones (in our case landcover data) will be represented by crosses.

DanielPlot2(merge_t_mean_landcover, title="Comparison landcover to precipitation")

Since extreme values like these do not seem to be very natural, this is probably not a very good result. This could be due to improper reclassification because the classes were arbitrary and might also not represent the optimal conditions for butterflies. When we plot a zoomed-in version of our landcover map with our presence locations being added, we can see that the different classes are quite fine-grained and well suited areas are right next to less well suited areas. In several cases, the butterflies are present at not well suited areas, but right next to green areas. This might lead to problems when evaluating presence probabilities. So for really taking the landcover into account, a more realistic classification should be conducted which actually takes habitat suitability into account instead of a rigid classification of vegetation cover.

test_locations_df <- data.frame(lon=NA, lat=NA)
for(i in 1:length(test_locations)){
  test_locations_df <- rbind(test_locations_df, setNames(as.data.frame(test_locations[[i]]), c("lon", "lat")))
}

plot(land_cover_reclassified, ext=c(71,75,34,36))
points(test_locations_df, pch="+", cex=0.7)

Influence of other environmental parameters

Before continuing, let’s take a look in the literature: A review of Dover & Settele (2009) adresses the habitat suitability for butterflies. The take home messages from this paper are:

Variability in rainfall can drive butterfly colonies to extinction
Butterflies are thermophlic
Temperature fluctuations can impact range contraction and expansion
Sward heigth can have a dramatic influence (influenced by grazing), some species require short swards (L. bellargus) and some high swards (T. aceton)
Some species can’t tolerate shade

So fluctuations of temperature and precipitation seem to play an important role. Also, the vegetation seems to be important, which we already checked. Next, we will do a screening of several environmental variables from the Bioclim dataset. As environmental variables were chosen:

BIO2 Mean Diurnal range
BIO3 Isothermality
BIO4 Temperature seasonality
BIO6 min temp of coldest month
BIO14 Precipitation of Driest Month
BIO15 Precipitation seasonality
BIO16 Precipitation of wettest quarter
BIO17 Precipitation of driest quarter
BIO18 Precipitation of warmest quarter

First, we need to prepare raster stacks of our environmental variable. For better comparison, we will use the mean temperature (layer 1 of Bioclim) and one layer of a different variable in each case.

BIO <- list()
j=0
for(i in c(2,3,4,6,14,15,16,17,18)){
  j = j+1
  BIO[[j]] <- stack(bioclim_mosaic$layer.1, bioclim_mosaic[[i]])
}

Next, we will crop the layers using lapply(). When renaming the layers, it is important to add a "_" before t_seasonality and t_min_coldest_month because otherwise the columns in our maxent@results will be switched and you will get strange results when using DanielExtract at this point note, that those functions can save a lot of time and are very convenient when doing a screening of different locations, but they are not professional and very fragile. Thus, they can be only applied in very special cases and should be treated with caution.

BIO_cropped <- lapply(BIO, FUN=mask, parkistan)

#typing _t_seasonality and _t_min_coldest month is important because otherwise the entries in the results table will be switched and you will get a strange result when plotting
BIO_names <- c("diurnal_range", "isothermality", "_t_seasonality", "_t_min_coldest_month", "prec.driest_month", "prec.seasonality", "prec.wettest_quarter", "prec.driest_quarter", "prec.warmest_quarter")

for(i in seq_along(BIO_names)){
names(BIO_cropped[[i]]) <- c("t_mean", BIO_names[i])}

Next, we can use lazy_MaxEnt() to directly use MaxEnt on our list of raster stacks:

maxent_BIO <- list()
for(i in (1:length(BIO_cropped))){
  maxent_BIO[[i]] <- lazy_MaxEnt(bg= bg_test, env.var = BIO_cropped[[i]], locations = test_locations)
}

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

Quick visualization of results

For an even more convenient way we can generate the function A_very_lazy_plot() which will give us a plot for comparison directly from our MaxEnt results without additional work. The input to this function is a list of MaxEnt data and another list we want to compare it to:

A_very_lazy_plot <- function(maxent_new, maxent_to_compare, title){
  new <- DanielExtract(maxent_new)
  compare <- DanielExtract(maxent_to_compare)
merge <- DanielMerge2(compare, new)
plot <- DanielPlot2(merge, title=title)
return(plot)
}

Using this function on our Bioclim data maxent_BIO and using maxent_cropped_bg as comparison we can obtain a list of plots which compare the result of each variable with the original result of t_mean and precipitation.

BIO_plots <- list()
for(i in (1:length(maxent_BIO))){
BIO_plots[[i]] <- A_very_lazy_plot(maxent_BIO[[i]], maxent_cropped_bg, title=BIO_names[i])
}

The object BIO_plots is a list of several plots. We could now take a look at each element of this list or we could make it even a bit more convenient to compare and add a slider so we can control, which plot we want to see. For the slider we will create a list which has the argument “visible” and this argument is set to true for one element in each list and false for the other. In this way, we can only see one plot at a time. After we created the slider, we can simply add it to subplot(BIO_plots), which will add all plots in a single one. Note that you first have to move the slider to see one separate plot, otherwise all plots will be stacked.

#Generate steps for slider
   steps <- list()
  for(i in 1:length(BIO_plots)){
  step <- list(args = list('visible', rep(FALSE, length(BIO_plots))), label=BIO_names[i],
               method = 'restyle')
  step$args[[2]][i] = TRUE  
  steps[[i]] = step 
} 

#Add slider to plots (subplot alone will merge all plots together)
slider_plot <- subplot(BIO_plots) %>%
  layout(sliders = list(list(active = 1,
                             currentvalue = list(prefix = "Environmental vaiables: "),
                             steps = steps)))

slider_plot

## Warning: 'layout' objects don't have these attributes: 'NA'
## Valid attributes include:
## 'font', 'title', 'uniformtext', 'autosize', 'width', 'height', 'margin', 'computed', 'paper_bgcolor', 'plot_bgcolor', 'separators', 'hidesources', 'showlegend', 'colorway', 'datarevision', 'uirevision', 'editrevision', 'selectionrevision', 'template', 'modebar', 'newshape', 'activeshape', 'meta', 'transition', '_deprecated', 'clickmode', 'dragmode', 'hovermode', 'hoverdistance', 'spikedistance', 'hoverlabel', 'selectdirection', 'grid', 'calendar', 'xaxis', 'yaxis', 'ternary', 'scene', 'geo', 'mapbox', 'polar', 'radialaxis', 'angularaxis', 'direction', 'orientation', 'editType', 'legend', 'annotations', 'shapes', 'images', 'updatemenus', 'sliders', 'colorscale', 'coloraxis', 'metasrc', 'barmode', 'bargap', 'mapType'

From this plot we can see the following:

Diurnal range: t_mean contribution shifted to the middle, AUC stayed mostly the same, strange spot stays strange, less extreme values of t_mean contrib.
Isothermality: basically the same
t_seasonality: Some spots moved from medium to extreme values of t_mean contrib., others moved from extreme to medium values, some moved from 100% to 0% -> does not look stable
t_min coldest month: same as t_seasonality
prec.driest month: most extreme values didn’t change a lot, one value shifted to unrealistic AUC, everything looks a bit worse
prec.seasonality: most spots shift to higher (extreme) values of t_mean contrib. -> does not look good
prec.wettest quarter: most spots did not change a lot, but less extreme values of 100% t_mean contrib -> looks a little bit beeter than using only precipitation
prec.driest quarter: extreme values didn’t change, medium values of t_mean contrib. decreased
prec. warmes quarter: same as with prec. driest quarter.
Strange spot (Tarucus rosaceus stays strange for every condition except for t_min_coldest_month

To visually interpret how well an environmental variable might correlate to species distribution, we can also plot the maps:

plot(climate$t_mean, ext=extent(parkistan), main="t_mean")
lapply(X=lapply(X=test_locations, FUN=SpatialPoints), FUN=points, pch="+", cex=0.5)

plot(climate$precip, ext=extent(parkistan), main="precip")
lapply(X=lapply(X=test_locations, FUN=SpatialPoints), FUN=points, pch="+", cex=0.5)

plot(BIO_cropped[[1]]$diurnal_range, ext=extent(parkistan), main="Diurnal range")
lapply(X=lapply(X=test_locations, FUN=SpatialPoints), FUN=points, pch="+", cex=0.5)

plot(BIO_cropped[[2]]$isothermality, ext=extent(parkistan), main="isothermality")
lapply(X=lapply(X=test_locations, FUN=SpatialPoints), FUN=points, pch="+", cex=0.5)

plot(BIO_cropped[[8]]$prec.driest_quarter, ext=extent(parkistan), main="Precip. of driest quarter")
lapply(X=lapply(X=test_locations, FUN=SpatialPoints), FUN=points, pch="+", cex=0.5)

plot(land_cover_reclassified, ext=extent(parkistan), pch="+", cex=0.5, main="Landcover reclassified")
lapply(X=lapply(X=test_locations, FUN=SpatialPoints), FUN=points, pch="+", cex=0.5)

From this plot, diurnal range seems to be quite good while precipitation and precipitation of driest quarter seem to be a little bit random.

Influence of MaxEnt features

Now after we took a look at the influence of different environmental parameters, we can also investigate different features of MaxEnt. As mentioned in the theory part there are linear (l), quadratic (q), threshold (t), hinge (h) and product (p) features determining how our environmental parameters will be processed. By default, all features will be enabled. We can take a look at the influence of those features by disabling a few of them and see if our result will change. With lazy_MaxEnt we also have the option to specify arguments so all we need is a list of arguments we want to test.

features <- list(c("quadratic=false"), c("product=false"),c("hinge=false"), c("threshold=false"), c("quadratic=false", "product=false"), c("quadratic=false", "hinge=false"), c("product=false", "hinge=false"), c("threshold=false", "hinge=false"), c("threshold=false", "hinge=false", "product=false"))

maxent_features <- list()
for(i in seq_along(features)){
maxent_features[[i]] <- lazy_MaxEnt(bg_test, BIO_cropped[[1]], test_locations, features[[i]])
}

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

## Warning in matrix(as.numeric(d)): NAs durch Umwandlung erzeugt

## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1 
## This is MaxEnt version 3.4.1

Now we can run A_very_lazy_plot on our results to directly generate a plot to compare our results.

feature_plots <- list()
for(i in (1:length(maxent_features))){
feature_plots[[i]] <- A_very_lazy_plot(maxent_features[[i]], maxent_BIO[[1]], title="Feature Screening")
}

#Generate steps for slider
 steps_2 <- list()
  for(i in 1:length(feature_plots)){
  step_2 <- list(args = list('visible', rep(FALSE, length(feature_plots))), label=paste(features[[i]][1],features[[i]][2],features[[i]][3]),
               method = 'restyle')
  step_2$args[[2]][i] = TRUE  
  steps_2[[i]] = step_2 
} 

#Add slider to plots (subplot alone will merge all plots together)
feature_plot <- subplot(feature_plots) %>%
  layout(sliders = list(list(active = 1,
                             currentvalue = list(prefix = "Features: "),
                             steps = steps_2)))

feature_plot

## Warning: 'layout' objects don't have these attributes: 'NA'
## Valid attributes include:
## 'font', 'title', 'uniformtext', 'autosize', 'width', 'height', 'margin', 'computed', 'paper_bgcolor', 'plot_bgcolor', 'separators', 'hidesources', 'showlegend', 'colorway', 'datarevision', 'uirevision', 'editrevision', 'selectionrevision', 'template', 'modebar', 'newshape', 'activeshape', 'meta', 'transition', '_deprecated', 'clickmode', 'dragmode', 'hovermode', 'hoverdistance', 'spikedistance', 'hoverlabel', 'selectdirection', 'grid', 'calendar', 'xaxis', 'yaxis', 'ternary', 'scene', 'geo', 'mapbox', 'polar', 'radialaxis', 'angularaxis', 'direction', 'orientation', 'editType', 'legend', 'annotations', 'shapes', 'images', 'updatemenus', 'sliders', 'colorscale', 'coloraxis', 'metasrc', 'barmode', 'bargap', 'mapType'

In this plot we can see that disabling some features does not lead to major changes regarding variable contribution, the number of iterations or training AUC. There are some little changes with the product and hinge features. Also, if changes are present, they are highly dependent on the species, so there is no overall trend. However, that those variables don’t change much does not necessarily mean that the prediction will also be equal, so we should plot some of the predictions. Here I chose to predict hinge=FALSE (maxent_features[[3]]) and threshold=FALSE (maxent_features[[4]]) as well as the default settings for comparison.

predict_hinge_f <- lapply(maxent_features[[3]], FUN=predict, BIO_cropped[[1]])
predict_threshold_f <- lapply(maxent_features[[4]], FUN=predict, BIO_cropped[[1]])
predict_default <- lapply(maxent_BIO[[1]], FUN=predict, BIO_cropped[[1]])

Plotting predictions of the features screening

When plotting the three predictions, we can see that there are indeed major differences in the predictions. In most cases, the default settings seem to produce the most accurate result. Interestingly, using hinge=FALSE and threshold=FALSE on the species Tarucus rosaceus does not generate a strange map like the default settings, but using those conditions on Pseudochazara_lehana produces a strange map which has been normal using default settings.

par(mfrow=c(2,3))
for(i in seq_along(predict_default)){
  plot(predict_default[[i]], ext=extent(parkistan), main=paste(test_names[[i]], "default"))
  points(SpatialPoints(test_locations[[i]]), pch="+", cex=0.5)
  plot(predict_hinge_f[[i]], ext=extent(parkistan), main=paste(test_names[[i]], "h = F"))
  points(SpatialPoints(test_locations[[i]]), pch="+", cex=0.5)
  plot(predict_threshold_f[[i]], ext=extent(parkistan), main=paste(test_names[[i]], "t = F"))
  points(SpatialPoints(test_locations[[i]]), pch="+", cex=0.5)
}

Conclusion

The most promising variable in this test seemed to be diurnal range. However, this variable was only tested in combination with t_mean and maybe there is another combination which might be better suited because this was a combination of two temperature variables. Disabling some features varied the predictions in some cases but in most cases, default settings seemed to be the best choice. However, this was highly dependant on the species and the generation of strange maps can be avoided by disabling features. However in these cases, the predictions were still not necessarily good.

With lazyMaxEnt() and A_very_lazy_plot(), the screening of more environmental variables is very convenient and does not take a lot of time, so you can also test more combinations, if you are interested.

Literature

Dover, J. & J. Settele (2009): The influences of landscape structure on butterfly distribution and movement: a review. Journal of Insect Conservation 13: 3–27.

Fletcher, R. & Marie-Josée Fortin (2019): Spatial Ecology and Conservation Modeling: Applications with R. Springer.

Merow, C., Smith, M. J. & J. A. Silander (2013): A practical guide to MaxEnt for modeling species’ distributions: what it does, and why inputs and settings matter. Ecography 36: 1058–1069.

Narkis S. Morales, N. S., Fernández, I. C. & Victoria Baca-González (2017): MaxEnt’s parameter configuration and small samples: are we paying attention to recommendations? A systematic review. PeerJ 1-16.

Phillips, S. J., Elith, J., Hastie, T., Dud, M., Chee, Y. E. & C. J. Yates (2011) A statistical explanation of MaxEnt for ecologists. Diversity Distrib. 17: 43–57.

Phillips, S. J., Anderson, R. P. & R. E. Schapired (2006): Maximum entropy modeling of species geographic distributions. Ecological Modelling 190: 231–259.

Guillera-Arroita, G., Lahoz-Monfort, J. & J. Elith (2014): Maxent is not a presence–absence method: a comment on Thibaud et al. Methods Ecol. Evol. 5: 1192–1197.

Session info

sessionInfo()

## R version 4.0.5 (2021-03-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rmdformats_1.0.2 rgeos_0.5-5      reactable_0.2.3  plotly_4.9.3    
## [5] dismo_1.3-3      dplyr_1.0.5      ggplot2_3.3.3    raster_3.4-5    
## [9] sp_1.4-5        
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1  xfun_0.22         bslib_0.2.4       purrr_0.3.4      
##  [5] rJava_1.0-4       lattice_0.20-41   colorspace_2.0-0  vctrs_0.3.8      
##  [9] generics_0.1.0    htmltools_0.5.1.1 viridisLite_0.4.0 yaml_2.2.1       
## [13] utf8_1.2.1        rlang_0.4.10      jquerylib_0.1.4   pillar_1.6.0     
## [17] reactR_0.4.4      glue_1.4.2        withr_2.4.2       DBI_1.1.1        
## [21] lifecycle_1.0.0   stringr_1.4.0     munsell_0.5.0     gtable_0.3.0     
## [25] htmlwidgets_1.5.3 codetools_0.2-18  evaluate_0.14     knitr_1.31       
## [29] crosstalk_1.1.1   fansi_0.4.2       highr_0.8         Rcpp_1.0.6       
## [33] scales_1.1.1      jsonlite_1.7.2    digest_0.6.27     stringi_1.5.3    
## [37] bookdown_0.22     grid_4.0.5        rgdal_1.5-23      tools_4.0.5      
## [41] magrittr_2.0.1    sass_0.3.1        lazyeval_0.2.2    tibble_3.1.1     
## [45] crayon_1.4.1      tidyr_1.1.3       pkgconfig_2.0.3   ellipsis_0.3.2   
## [49] data.table_1.14.0 assertthat_0.2.1  rmarkdown_2.7     httr_1.4.2       
## [53] R6_2.5.0          compiler_4.0.5