A generalized linear model (GLM) is an extension of the linear regression model. Linear models have a dependent variable Y witch is equal to a point on the linear fuktionY = ax + b.
In GLMs the predictor variables X, in our case biofactors like precipitation, are joined to create a linear predictor. This linear predictor is connected with the response variable Y via a link funktion. Therefor the distribution of the response variable does not nessarily have to be Gaussian. It can also be e.g. Poisson or binomial.
Load packages:
library("raster")
library("rgdal")
library("maptools")
library("geonames")
library("dismo")
Clear workspace
rm(list=ls())
Set a seed to ensure reproducibility
set.seed(0)
Read in the Shapefile “distribution_merged_Pakistan”, These file shows the records of the individual species of butterflys in Pakistan.
butterflys <- readOGR("E:/Uni/5. Semester/Pakistan_Projekt/bsc-sdm-2019/data/distribution/distribution_merged_Pakistan")
## OGR data source with driver: ESRI Shapefile
## Source: "E:\Uni\5. Semester\Pakistan_Projekt\bsc-sdm-2019\data\distribution\distribution_merged_Pakistan", layer: "distribution_Pakistan_all_test"
## with 5870 features
## It has 3 fields
pakistan <- getData("GADM", country="Pakistan", level=0) #Load the boundaries of pakistan
plot(pakistan) #Plot the boundaries of pakistan
plot(butterflys, add = TRUE, col='blue') #We add all recorded butterflys to our plot
Now we can check the data. We see, that some of the points are outside pakistan. The majority lies within the boundarys.
Set working directory
wd <- "E:/Uni/5. Semester/Pakistan_Projekt/GLM"
setwd(wd)
This code extracts the coordinates of the records out of the shapefile and write them into a new dataframe called“butterf”.
head(butterflys) #Shows the head of the butterfly dataframe
## species subspecies id
## 1 Actinor_radians <NA> 0
## 2 Actinor_radians <NA> 0
## 3 Actinor_radians <NA> 0
## 4 Aeromachus_stigmatus <NA> 0
## 5 Aeromachus_stigmatus <NA> 0
## 6 Aeromachus_stigmatus <NA> 0
but_cor<-coordinates(butterflys) #Seperates just the coordinates
butterf<-as.data.frame(but_cor)
colnames(butterf) = c('x','y') #Changes the names of the two columns
For this example we need the records of one speciese.g.: Parnara bada..
Therefor we seperate the coordinates for just one species:
Parnara_bada <- butterf[111:145,] # extracts the coordinates of *Parnara bada*
Source: https://en.wikipedia.org/wiki/Parnara_bada#/media/File:Parnara_bada_sida.jpg. Date: 07.02.2020
Load Raster Data and stack the three Layers, plots the layers with the extent. These raster layer will be our predictors for the GLM.
At first we load the Chelsa data. You can get different biovariables on the website http://chelsa-climate.org/downloads/
We have downloaded three layers:
Chelsa_bio10_10 = warmest quarter mean Temperature global
Chelsa_bio10_12 = annual global precipitation
Chelsa_bio10_18 = precipitation of the warmest quarter
files<-list.files("E:/Uni/5. Semester/Pakistan_Projekt/Daten/Raster/Predictors",pattern = ".tif", full.names = TRUE) # This line creates a vector with all tif-files in this location
chelsa <- raster::stack(files)
chelsa
## class : RasterStack
## dimensions : 20880, 43200, 902016000, 3 (nrow, ncol, ncell, nlayers)
## resolution : 0.008333333, 0.008333333 (x, y)
## extent : -180.0001, 179.9999, -90.00014, 83.99986 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## names : CHELSA_bio10_01, CHELSA_bio10_12, CHELSA_bio10_18
## min values : -32768, -32768, -32768
## max values : 32767, 32767, 32767
nlayers(chelsa) # shows the number of layers in the raster stack
## [1] 3
crs(chelsa) # shows the coordinate referent system of the layer
## CRS arguments:
## +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
Loads the three Worldclim-Layers, Source: http://worldclim.org/version2
wc_tavg<-raster("E:/Uni/5. Semester/Pakistan_Projekt/Daten/Raster/Worldclim/worldclim_temp_avg/tavg_mean.tif",pattern = ".tif", full.names = TRUE)
wc_tavg
## class : RasterLayer
## dimensions : 21600, 43200, 933120000 (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333 (x, y)
## extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## source : E:/Uni/5. Semester/Pakistan_Projekt/Daten/Raster/Worldclim/worldclim_temp_avg/tavg_mean.tif
## names : tavg_mean
wc_tmax<-raster("E:/Uni/5. Semester/Pakistan_Projekt/Daten/Raster/Worldclim/worldclim_temp_max/tmax_mean.tif",pattern = ".tif", full.names = TRUE)
wc_tmax
## class : RasterLayer
## dimensions : 21600, 43200, 933120000 (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333 (x, y)
## extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## source : E:/Uni/5. Semester/Pakistan_Projekt/Daten/Raster/Worldclim/worldclim_temp_max/tmax_mean.tif
## names : tmax_mean
wc_tmin<-raster("E:/Uni/5. Semester/Pakistan_Projekt/Daten/Raster/Worldclim/worldclim_temp_min/tmin_mean.tif",pattern = ".tif", full.names = TRUE)
wc_tmin
## class : RasterLayer
## dimensions : 21600, 43200, 933120000 (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333 (x, y)
## extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## source : E:/Uni/5. Semester/Pakistan_Projekt/Daten/Raster/Worldclim/worldclim_temp_min/tmin_mean.tif
## names : tmin_mean
Stack the worldclim layers together.
wc_layers <- c(wc_tavg,wc_tmax,wc_tmin)
wc<-stack(wc_layers)
Creates an extent with the extent. We need an extent to ensure, that every layer has the same extent.
ext <- extent(c(60, 78, 23, 38))
ext
## class : Extent
## xmin : 60
## xmax : 78
## ymin : 23
## ymax : 38
With the comand “crop” you cut out a raster layer with a polygon, in this case with the extent.
chelsa_ext<-crop(chelsa,ext)
wc_ext<-crop(wc,ext)
chelsa_ext
## class : RasterBrick
## dimensions : 1800, 2160, 3888000, 3 (nrow, ncol, ncell, nlayers)
## resolution : 0.008333333, 0.008333333 (x, y)
## extent : 59.99986, 77.99986, 22.99986, 37.99986 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## source : C:/Users/myDickshardt/AppData/Local/Temp/RtmpgLvQ3T/raster/r_tmp_2020-02-11_220702_12048_82359.grd
## names : CHELSA_bio10_01, CHELSA_bio10_12, CHELSA_bio10_18
## min values : -232, 35, 0
## max values : 299, 3406, 3221
wc_ext
## class : RasterBrick
## dimensions : 1800, 2160, 3888000, 3 (nrow, ncol, ncell, nlayers)
## resolution : 0.008333333, 0.008333333 (x, y)
## extent : 60, 78, 23, 38 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## source : C:/Users/myDickshardt/AppData/Local/Temp/RtmpgLvQ3T/raster/r_tmp_2020-02-11_220707_12048_28966.grd
## names : tavg_mean, tmax_mean, tmin_mean
## min values : -33.85000, -32.90833, -34.80000
## max values : 28.82500, 38.23333, 22.16667
Now we can stack the layers together. Make sure that every layer has the same extent, otherwise the command “stack” won`t work.
b <- c(chelsa_ext,wc_ext)
predictors <- stack(b)
We can check the data by plotting a map with the first Layer of the predictors and all records. The blue points are all butterfly recorts. The red point are recorts of Parnara bada.
data(wrld_simpl)
plot(predictors,1, main = "Check the data")
plot(wrld_simpl, add=TRUE, )
points(butterf, col='blue')
points(Parnara_bada, col="red", pch=18)
We need some background data for the absent data, This funktion creates random Points in the observation area.
backgr <- randomPoints(predictors, nrow(Parnara_bada)) #Creates as much points as we have present points in Parnara_bada
For a generalized linear model we need present and absent data. In the vektor “presvals” are the values of the predictor layers at the lokation of the records. In “absvals” are the values at the lokation of the randam points.
presvals <- extract(predictors, Parnara_bada)
absvals <- extract(predictors, backgr)
We create a vektor (pb) witch writes for absent data 0 and for present data 1. Afterwards we build a dataframe with this vector and the present and absent data
pb <- c(rep(1, nrow(presvals)), rep(0, nrow(absvals)))
sdmdata <- data.frame(cbind(pb, rbind(presvals, absvals)))
The command kfold gives every recort in the dataframe a number between 1 and k (in this case 5). We need this operation to get train and test data to test the model later on.
group <- kfold(Parnara_bada, 5)
pres_train <- Parnara_bada[group != 1, ]
pres_test <- Parnara_bada[group == 1, ]
Now we do the same for the background data.
backg <- randomPoints(predictors, nrow(Parnara_bada), ext=ext, extf = 1.05) # With extf the extent increases
colnames(backg) = c('x','y')
group <- kfold(backg, 5)
backg_train <- backg[group != 1, ]
backg_test <- backg[group == 1, ]
We create a matrix out of the present and background training data
train <- rbind(pres_train, backg_train)
We build a dataframe with the training data and the predictor layers
pb_train <- c(rep(1, nrow(pres_train)), rep(0, nrow(backg_train)))
envtrain <- extract(predictors, train)
envtrain <- data.frame( cbind(pa=pb_train, envtrain) )
head(envtrain) #shows the head of the dataframe
## pa CHELSA_bio10_01 CHELSA_bio10_12 CHELSA_bio10_18 tavg_mean tmax_mean
## 1 1 215 963 134 21.94167 30.10833
## 2 1 166 1406 735 17.58333 24.53333
## 3 1 134 1829 735 15.93333 22.05000
## 4 1 143 1933 900 16.49167 22.16667
## 5 1 243 518 82 23.82500 31.46667
## 6 1 233 870 128 23.37500 31.46667
## tmin_mean
## 1 13.75000
## 2 10.65833
## 3 9.80000
## 4 10.81667
## 5 16.15833
## 6 15.34167
The same for the test data. You need train and test-data if you want to do cross-validation later on.
testpres <- data.frame( extract(predictors, pres_test) )
testbackg <- data.frame( extract(predictors, backg_test) )
We have everything to fit the generalized linear model. The first statement in the glm-funktion is the formula. In our case we have the depending variabel “pa”, witch can have the value 0 or 1. “1” stands for the presence and “0” for the absence of an individual of Parnara bada. The secont part of the formula are the values of the enviroment-layers witch are within the envtrain data. The Point (.) you put every value into the model.
The family statment is for choosing a distribution out of the exponantial family. When you have binary data, you should use binomial in the family-statement with the logit link funktion.
m1 <- glm(pa ~ .,family = binomial(link = "logit"), data=envtrain)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
m1
##
## Call: glm(formula = pa ~ ., family = binomial(link = "logit"), data = envtrain)
##
## Coefficients:
## (Intercept) CHELSA_bio10_01 CHELSA_bio10_12 CHELSA_bio10_18
## -1.863e+00 -7.483e-02 4.232e-03 2.815e-02
## tavg_mean tmax_mean tmin_mean
## -1.431e+02 7.169e+01 7.213e+01
##
## Degrees of Freedom: 55 Total (i.e. Null); 49 Residual
## Null Deviance: 77.63
## Residual Deviance: 23.64 AIC: 37.64
summary(m1)
##
## Call:
## glm(formula = pa ~ ., family = binomial(link = "logit"), data = envtrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.54002 -0.31510 -0.01282 0.00125 2.65694
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.863e+00 5.869e+00 -0.317 0.7509
## CHELSA_bio10_01 -7.483e-02 4.693e-02 -1.594 0.1108
## CHELSA_bio10_12 4.232e-03 2.461e-03 1.719 0.0855 .
## CHELSA_bio10_18 2.815e-02 1.449e-02 1.942 0.0521 .
## tavg_mean -1.431e+02 7.233e+01 -1.978 0.0479 *
## tmax_mean 7.169e+01 3.631e+01 1.974 0.0484 *
## tmin_mean 7.213e+01 3.623e+01 1.991 0.0465 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 77.632 on 55 degrees of freedom
## Residual deviance: 23.642 on 49 degrees of freedom
## AIC: 37.642
##
## Number of Fisher Scoring iterations: 9
The residual deviance gives us the quality of the fit of the model. The higher the number, the worse the fit.
The coefficients of worldclim are significant, while the coefficients of Chelsa aren`t significant. On the output we see the AIC (Akaike Information Criterion). With this value you are able to compare models with each other. It is recommended to choose the model with the smallerst AIC (Wollschläger 2017).
prediction <- predict(predictors,m1, ext=ext)
plot(prediction, main = "Prediction for Parnara bada")
plot(wrld_simpl, add=TRUE)
points(Parnara_bada, col="red", pch=18)
For the assessment of the occurrence of a species it is necessary to create a threshold. In this particular case we use the upper quantile as the threshold. Liu et.al (2005) shows several other options.
We want to have a map witch shows the occurrence as a “1” and the absence of Parnara bada as a “0”. For this reason we need to reclassify the prediction. Every value witch is under the threshold becomes a “0”, the ones over the threshold a “1”.
threshold_glm <- raster::quantile(prediction)[4]
classification_matrix <- matrix(c(minValue(prediction),threshold_glm , 0, # This function creates a matrix for the reclassify-funktion
threshold_glm, maxValue(prediction), 1),
ncol=3, byrow = TRUE)
result_presence_absence <- reclassify(prediction, rcl = classification_matrix) #Reclassification for the prediction
presence_absence_map <- result_presence_absence
plot(presence_absence_map, main = "Presence/absence of Parnara bada")
plot(wrld_simpl, add=TRUE)
points(Parnara_bada, col="red", pch=18)
The map showsthe distribution area of Parnara bada..
In the green area the conditions for Parnara bada are given. The red pionts are the records of the species.
The last step is to write out and save the map. Therefor you can use the writeRaster argument. Make sure, that you set the path correctly.
writeRaster(presence_absence_map, filename = file.path("E:/Uni/5. Semester/Pakistan_Projekt/GLM/", "Distribution_output"), format="GTiff", overwrite = TRUE)
Baltes-Goetz, B. (2016): Generalisierte lineare Modelle und GEE-Modesse in SPSS Statistics. https://www.uni-trier.de/fileadmin/urt/doku/gzlm_gee/gzlm_gee.pdf Date: 20.01.2020. Trier.
Hijmans, R.H., Elith, J. (2019): Spatial Distribution Models. https://rspatial.org/sdm/SDM.pdf Date: 10.Jan.2020.
Hijmans, R.H. (2019): Point pattern analysis. https://rspatial.org/raster/analysis/8-pointpat.html Date: 08.Jan.2020.
Liu, C., Berry, P. M., Dawson, T.P., Pearson, R. G. (2005): Selecting thresholds of occurrence in the prediction of species distributions. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.0906-7590.2005.03957.x Date: 29.Jan.2020.
Tshikolovents, V. (2016): The Butterflies of Pakistan. Kiew
Wollschlaeger, D. (2017): Grundlagen der Datenanalyse mit R: Eine anwendungsorientierte Einführung. 4. Auflage. Berlin.