eXtrem Gradient Boosting (XGBoost)
Tree boosting is a widely used machine learning method and XGBoost use far fewer resources than existing systems. It`s usage went from advertising systems learn to match from users and their interests up to detecting anomaly events in physics that lead to new pyhsics (Chen, T., Guestrin, C. 2016, P. 1).
Example for Decision Tree:
Note: Score in model is not the Gini impurity!
Gradient Boost
Creates a single leaf which represents an initial guess for a value of all of the samples. Out of the leaf it´s create a tree and build a next tree based on the errors made by the previous tree. Until it failed to fit or it created the trees you set at the beginning.
Statquest Gradient Boost Part 1 (of 4): Regression Main Ideas
XGBoost
Does an initial prediction regardless of regression or classification. Use unique regression trees starting with one leaf. Leaf contains a similarity score of (lamdba). This scores used for calculating the gain. Higher gain = better split in a tree. Gamma can be used to prune Trees.
Lambda is a regularization parameter this results into a less higher sensitivity to individual observations. It also prevent the training data to over fitting.
Next tree will be created by learning rate “eta” (default setting is 0.3).
Statquest XGBoost Part 1 (of 4): Regression
[XGBoost read the docs]: https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html
Data Preparation
#Loading Tools
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.0.5
library(raster)
## Loading required package: sp
#Check Workspace
getwd()
## [1] "C:/Users/Jan-Niklas Tripp/Desktop/SDM/XGBoost_Tutorial"
#Set Workspace into created Folder
setwd("C:/Users/Jan-Niklas Tripp/Desktop/SDM/XGBoost_Tutorial")
##Data Preparation
#Loading the CSV Data
species_locations = read.csv("PakistanLadakh.csv", sep = ",")
#Check for Colnames
head(species_locations)
## ï..species x y
## 1 Acraea_issoria 73.08635 34.43060
## 2 Acraea_issoria 73.36326 34.24367
## 3 Acraea_issoria 73.36758 34.14221
## 4 Acraea_issoria 73.82637 33.84266
## 5 Acraea_issoria 74.19839 33.94852
## 6 Acraea_violae 72.34951 31.70947
#Change Colnames for readability
colnames(species_locations) = cbind("species", "long", "lat")
#Check for NA Values
summary(species_locations)
## species long lat
## Length:5870 Min. :62.23 Min. :24.13
## Class :character 1st Qu.:71.90 1st Qu.:33.79
## Mode :character Median :73.20 Median :34.74
## Mean :72.71 Mean :34.27
## 3rd Qu.:73.93 3rd Qu.:35.94
## Max. :77.78 Max. :37.12
#New Colnames check
head(species_locations)
## species long lat
## 1 Acraea_issoria 73.08635 34.43060
## 2 Acraea_issoria 73.36326 34.24367
## 3 Acraea_issoria 73.36758 34.14221
## 4 Acraea_issoria 73.82637 33.84266
## 5 Acraea_issoria 74.19839 33.94852
## 6 Acraea_violae 72.34951 31.70947
##A raster background map showing elevation data
#get ISO code for Pakistan
getData("ISO3")
## ISO3 NAME
## 1 AFG Afghanistan
## 2 XAD Akrotiri and Dhekelia
## 3 ALA Åland
## 4 ALB Albania
## 5 DZA Algeria
## 6 ASM American Samoa
## 7 AND Andorra
## 8 AGO Angola
## 9 AIA Anguilla
## 10 ATA Antarctica
## 11 ATG Antigua and Barbuda
## 12 ARG Argentina
## 13 ARM Armenia
## 14 ABW Aruba
## 15 AUS Australia
## 16 AUT Austria
## 17 AZE Azerbaijan
## 18 BHS Bahamas
## 19 BHR Bahrain
## 20 BGD Bangladesh
## 21 BRB Barbados
## 22 BLR Belarus
## 23 BEL Belgium
## 24 BLZ Belize
## 25 BEN Benin
## 26 BMU Bermuda
## 27 BTN Bhutan
## 28 BOL Bolivia
## 29 BES Bonaire, Saint Eustatius and Saba
## 30 BIH Bosnia and Herzegovina
## 31 BWA Botswana
## 32 BVT Bouvet Island
## 33 BRA Brazil
## 34 IOT British Indian Ocean Territory
## 35 VGB British Virgin Islands
## 36 BRN Brunei
## 37 BGR Bulgaria
## 38 BFA Burkina Faso
## 39 BDI Burundi
## 40 KHM Cambodia
## 41 CMR Cameroon
## 42 CAN Canada
## 43 CPV Cape Verde
## 44 XCA Caspian Sea
## 45 CYM Cayman Islands
## 46 CAF Central African Republic
## 47 TCD Chad
## 48 CHL Chile
## 49 CHN China
## 50 CXR Christmas Island
## 51 XCL Clipperton Island
## 52 CCK Cocos Islands
## 53 COL Colombia
## 54 COM Comoros
## 55 COK Cook Islands
## 56 CRI Costa Rica
## 57 CIV Côte d'Ivoire
## 58 HRV Croatia
## 59 CUB Cuba
## 60 CUW Curaçao
## 61 CYP Cyprus
## 62 CZE Czech Republic
## 63 COD Democratic Republic of the Congo
## 64 DNK Denmark
## 65 DJI Djibouti
## 66 DOM Dominican Republic
## 67 DMA Dominica
## 68 ECU Ecuador
## 69 EGY Egypt
## 70 SLV El Salvador
## 71 GNQ Equatorial Guinea
## 72 ERI Eritrea
## 73 EST Estonia
## 74 ETH Ethiopia
## 75 FLK Falkland Islands
## 76 FRO Faroe Islands
## 77 FJI Fiji
## 78 FIN Finland
## 79 FRA France
## 80 GUF French Guiana
## 81 PYF French Polynesia
## 82 ATF French Southern Territories
## 83 GAB Gabon
## 84 GMB Gambia
## 85 GEO Georgia
## 86 DEU Germany
## 87 GHA Ghana
## 88 GIB Gibraltar
## 89 GRC Greece
## 90 GRL Greenland
## 91 GRD Grenada
## 92 GLP Guadeloupe
## 93 GUM Guam
## 94 GTM Guatemala
## 95 GGY Guernsey
## 96 GNB Guinea-Bissau
## 97 GIN Guinea
## 98 GUY Guyana
## 99 HTI Haiti
## 100 HMD Heard Island and McDonald Islands
## 101 HND Honduras
## 102 HKG Hong Kong
## 103 HUN Hungary
## 104 ISL Iceland
## 105 IND India
## 106 IDN Indonesia
## 107 IRN Iran
## 108 IRQ Iraq
## 109 IRL Ireland
## 110 IMN Isle of Man
## 111 ISR Israel
## 112 ITA Italy
## 113 JAM Jamaica
## 114 JPN Japan
## 115 JEY Jersey
## 116 JOR Jordan
## 117 KAZ Kazakhstan
## 118 KEN Kenya
## 119 KIR Kiribati
## 120 XKO Kosovo
## 121 KWT Kuwait
## 122 KGZ Kyrgyzstan
## 123 LAO Laos
## 124 LVA Latvia
## 125 LBN Lebanon
## 126 LSO Lesotho
## 127 LBR Liberia
## 128 LBY Libya
## 129 LIE Liechtenstein
## 130 LTU Lithuania
## 131 LUX Luxembourg
## 132 MAC Macao
## 133 MKD Macedonia
## 134 MDG Madagascar
## 135 MWI Malawi
## 136 MYS Malaysia
## 137 MDV Maldives
## 138 MLI Mali
## 139 MLT Malta
## 140 MHL Marshall Islands
## 141 MTQ Martinique
## 142 MRT Mauritania
## 143 MUS Mauritius
## 144 MYT Mayotte
## 145 MEX Mexico
## 146 FSM Micronesia
## 147 MDA Moldova
## 148 MCO Monaco
## 149 MNG Mongolia
## 150 MNE Montenegro
## 151 MSR Montserrat
## 152 MAR Morocco
## 153 MOZ Mozambique
## 154 MMR Myanmar
## 155 NAM Namibia
## 156 NRU Nauru
## 157 NPL Nepal
## 158 NLD Netherlands
## 159 NCL New Caledonia
## 160 NZL New Zealand
## 161 NIC Nicaragua
## 162 NER Niger
## 163 NGA Nigeria
## 164 NIU Niue
## 165 NFK Norfolk Island
## 166 PRK North Korea
## 167 XNC Northern Cyprus
## 168 MNP Northern Mariana Islands
## 169 NOR Norway
## 170 OMN Oman
## 171 PAK Pakistan
## 172 PLW Palau
## 173 PSE Palestina
## 174 PAN Panama
## 175 PNG Papua New Guinea
## 176 XPI Paracel Islands
## 177 PRY Paraguay
## 178 PER Peru
## 179 PHL Philippines
## 180 PCN Pitcairn Islands
## 181 POL Poland
## 182 PRT Portugal
## 183 PRI Puerto Rico
## 184 QAT Qatar
## 185 COG Republic of Congo
## 186 REU Reunion
## 187 ROU Romania
## 188 RUS Russia
## 189 RWA Rwanda
## 190 BLM Saint-Barthélemy
## 191 MAF Saint-Martin
## 192 SHN Saint Helena
## 193 KNA Saint Kitts and Nevis
## 194 LCA Saint Lucia
## 195 SPM Saint Pierre and Miquelon
## 196 VCT Saint Vincent and the Grenadines
## 197 WSM Samoa
## 198 SMR San Marino
## 199 STP Sao Tome and Principe
## 200 SAU Saudi Arabia
## 201 SEN Senegal
## 202 SRB Serbia
## 203 SYC Seychelles
## 204 SLE Sierra Leone
## 205 SGP Singapore
## 206 SXM Sint Maarten
## 207 SVK Slovakia
## 208 SVN Slovenia
## 209 SLB Solomon Islands
## 210 SOM Somalia
## 211 ZAF South Africa
## 212 SGS South Georgia and the South Sandwich Islands
## 213 KOR South Korea
## 214 SSD South Sudan
## 215 ESP Spain
## 216 XSP Spratly Islands
## 217 LKA Sri Lanka
## 218 SDN Sudan
## 219 SUR Suriname
## 220 SJM Svalbard and Jan Mayen
## 221 SWZ Swaziland
## 222 SWE Sweden
## 223 CHE Switzerland
## 224 SYR Syria
## 225 TWN Taiwan
## 226 TJK Tajikistan
## 227 TZA Tanzania
## 228 THA Thailand
## 229 TLS East Timor
## 230 TGO Togo
## 231 TKL Tokelau
## 232 TON Tonga
## 233 TTO Trinidad and Tobago
## 234 TUN Tunisia
## 235 TUR Turkey
## 236 TKM Turkmenistan
## 237 TCA Turks and Caicos Islands
## 238 TUV Tuvalu
## 239 UGA Uganda
## 240 UKR Ukraine
## 241 ARE United Arab Emirates
## 242 GBR United Kingdom
## 243 UMI United States Minor Outlying Islands
## 244 USA United States
## 245 URY Uruguay
## 246 UZB Uzbekistan
## 247 VUT Vanuatu
## 248 VAT Vatican City
## 249 VEN Venezuela
## 250 VNM Vietnam
## 251 VIR Virgin Islands, U.S.
## 252 WLF Wallis and Futuna
## 253 ESH Western Sahara
## 254 YEM Yemen
## 255 ZMB Zambia
## 256 ZWE Zimbabwe
#ISO code = PAK
elevation_PAK = getData("alt", country="PAK", level=0)
#Boundary of Pakistan for
boundary_PAK = getData("GADM", country="PAK", level=0)
show(boundary_PAK)
## class : SpatialPolygonsDataFrame
## features : 1
## extent : 60.89944, 77.84308, 23.70292, 37.09701 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## variables : 2
## names : GID_0, NAME_0
## value : PAK, Pakistan
#loading Bioclim Data Bio 2 around Pakistan
bioclim_data = getData(name = "worldclim",
var = "bio",
res = 2.5,
lon = 23.70292,
lat = 77.84308)[[c(2)]]
#Elevation Map
plot(elevation_PAK, main = "Elevation of Pakistan", col = terrain.colors(255))
#Checking if data are out of Pakistan
points(x = species_locations$long,
y = species_locations$lat,
col = "olivedrab",
pch = 15,
cex = 0.50)
#Convert to SP-DataFrame
coordinates(species_locations) = ~long+lat
#Setting CRS to Elevation Map
crs(species_locations) = crs(boundary_PAK)
#Crop species to Pakistan
crop_species_locations = crop(x = species_locations, y = boundary_PAK)
#Plot for checking
plot(elevation_PAK, main = "Elevation of Pakistan", col = terrain.colors(255))
plot(crop_species_locations, add = TRUE)
#loading Bioclim Data Bio 2
bioclim_data = getData(name = "worldclim",
var = "bio",
res = 5)[[c(2)]]
plot(bioclim_data, main = "Worldclim map of layer Bio_2")
#Shows extent of Pakistan
print(extent(boundary_PAK))
## class : Extent
## xmin : 60.89944
## xmax : 77.84308
## ymin : 23.70292
## ymax : 37.09701
#Create an extent Object which contains the max and min Values for Longitude and Latitude
extent_pak = extent(boundary_PAK)
#Use the extent via Crop Function on bioclim Data
bioclim_data = crop(bioclim_data, extent_pak) #bioclim_data = mask(bioclim_data, boundary_PAK)
#Test Plot for visual check
plot(bioclim_data, main = "Worldclim map of layer Bio_2")
plot(crop_species_locations, add = TRUE)
Example of XGBoost
#Matrix of bioclim_data
bio_values = extract(bioclim_data, extent_pak)
head(bio_values)
## [1] 140 140 140 140 139 140
#Matrix of Values where our species are located
species_bio = extract(bioclim_data, crop_species_locations)
head(species_bio)
## [1] 121 90 89 99 88 148
#Our train dataset.
train_bio = data.matrix(species_bio)
head(train_bio)
## [,1]
## [1,] 121
## [2,] 90
## [3,] 89
## [4,] 99
## [5,] 88
## [6,] 148
#Remember we need to transform our species (Char) into numeric values
#For this we have to convert the char in a factor and than into a number otherwise it contains only NA
classes = as.numeric(as.factor(crop_species_locations$species)) -1
head(classes)
## [1] 0 0 0 0 0 1
#A very simple xgb model
xgb <- xgboost(booster="gbtree",
data = train_bio,
label = classes,
eta = 0.3,
max_depth = 6,
nround=10,
objective = "multi:softprob",
num_class = length(unique(classes)),#Number of unique Classes = 421
nthread = 3)
## [09:59:29] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
## [1] train-mlogloss:5.661157
## [2] train-mlogloss:5.437970
## [3] train-mlogloss:5.276015
## [4] train-mlogloss:5.150365
## [5] train-mlogloss:5.049788
## [6] train-mlogloss:4.967163
## [7] train-mlogloss:4.898900
## [8] train-mlogloss:4.841171
## [9] train-mlogloss:4.791974
## [10] train-mlogloss:4.749990
#Use the Model on the Matrix of bioclim_data
result = predict(xgb, as.matrix(bio_values))
#Creating a raster with bioclim_data
res = raster(bioclim_data)
#Giving the raster the Values of Species 1 and their probability to occur in one Cell of the raster
res = setValues(res, result[1:32683]+1)
#Plotting the first Species
plot(res, main = "Probability of species 1 to occur")
XGBoost with Train and Test data
#Set.seed ensures that we get the same sample
set.seed(1)
#Train index contain 25% random numbers in range of the length from species_bio
train_index = sample(length(species_bio), floor(0.25*length(species_bio)))
head(train_index)
## [1] 1017 4775 2177 5026 1533 4567
#Train data contain 25% of the data of species_bio
train_data = species_bio[train_index]
head(train_index)
## [1] 1017 4775 2177 5026 1533 4567
#Test data contain 75% of the data of species_bio
test_data = species_bio[-train_index]
head(test_data)
## [1] 121 90 89 99 88 148
#Container for species classes
train_label = as.numeric(as.factor(classes[train_index]))-1
test_label = as.numeric(as.factor(classes[-train_index]))-1
#Container for species bio clim values
train_data2 = as.matrix(train_data)
test_data2 = as.matrix(test_data)
#Create an xgb model with train_data2
xgb_test <- xgboost(booster="gbtree",
data = train_data2,
label = train_label,
eta = 0.9,
max_depth = 8,
nround=50,
objective = "multi:softmax",#softprob doesn´t work with this set of data.
num_class = length(unique(train_label)),
nthread = 3)
## [09:59:36] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softmax' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
## [1] train-mlogloss:5.003776
## [2] train-mlogloss:4.597611
## [3] train-mlogloss:4.347867
## [4] train-mlogloss:4.176858
## [5] train-mlogloss:4.055059
## [6] train-mlogloss:3.957138
## [7] train-mlogloss:3.877603
## [8] train-mlogloss:3.808987
## [9] train-mlogloss:3.749453
## [10] train-mlogloss:3.699752
## [11] train-mlogloss:3.655303
## [12] train-mlogloss:3.618612
## [13] train-mlogloss:3.589193
## [14] train-mlogloss:3.563043
## [15] train-mlogloss:3.540428
## [16] train-mlogloss:3.521586
## [17] train-mlogloss:3.504595
## [18] train-mlogloss:3.488876
## [19] train-mlogloss:3.475848
## [20] train-mlogloss:3.463068
## [21] train-mlogloss:3.451809
## [22] train-mlogloss:3.441724
## [23] train-mlogloss:3.432001
## [24] train-mlogloss:3.423303
## [25] train-mlogloss:3.415419
## [26] train-mlogloss:3.408140
## [27] train-mlogloss:3.400987
## [28] train-mlogloss:3.394957
## [29] train-mlogloss:3.388676
## [30] train-mlogloss:3.383113
## [31] train-mlogloss:3.377686
## [32] train-mlogloss:3.372787
## [33] train-mlogloss:3.367757
## [34] train-mlogloss:3.363231
## [35] train-mlogloss:3.358610
## [36] train-mlogloss:3.354526
## [37] train-mlogloss:3.350906
## [38] train-mlogloss:3.347500
## [39] train-mlogloss:3.343986
## [40] train-mlogloss:3.340747
## [41] train-mlogloss:3.337713
## [42] train-mlogloss:3.334894
## [43] train-mlogloss:3.331864
## [44] train-mlogloss:3.329363
## [45] train-mlogloss:3.327100
## [46] train-mlogloss:3.324417
## [47] train-mlogloss:3.321903
## [48] train-mlogloss:3.319762
## [49] train-mlogloss:3.317321
## [50] train-mlogloss:3.314564
#Predict with the model classes in test_data
result2 = predict(xgb_test, test_data2)
head(test_label)
## [1] 0 0 0 0 0 1
head(result2)
## [1] 132 235 41 3 204 111
Just Predict 2 Species
#New container for species_locations
species1 = species_locations
#Transform species in species1 into numbers
species1$species = as.numeric(as.factor(species1$species))
#I choose the species 6 because it has 53 entries and added another one to the set
species1 = species1[species1$species==6 | species1$species==5,]
head(species1)
## species
## 16 5
## 17 5
## 18 5
## 19 5
## 20 5
## 21 5
species1_bio = extract(bioclim_data, species1)
#Set class value of species 5 to 0
species1[species1$species==5,] = 0
head(species1)
## species
## 16 0
## 17 0
## 18 0
## 19 0
## 20 0
## 21 0
set.seed(2)
#Train index contain 25% random numbers in range of the length from species1
train_index2 = sample(length(species1), floor(0.25*length(species1)))
head(train_index2)
## [1] 21 15 6 60 32 8
#Train data contain 25% of the data of species1_bio
train_data3 = species1_bio[train_index2]
head(train_index2)
## [1] 21 15 6 60 32 8
#Test data contain 75% of the data of species1_bio
test_data3 = species1_bio[-train_index2]
head(test_data2)
## [,1]
## [1,] 121
## [2,] 90
## [3,] 89
## [4,] 99
## [5,] 88
## [6,] 148
#Container for species classes
train_label2 = as.numeric(as.factor(species1$species[train_index2]))-1
test_label2 = as.numeric(as.factor(species1$species[-train_index2]))-1
#Container for species bio clim values
train_data4 = as.matrix(train_data3)
test_data4 = as.matrix(test_data3)
#Create an xgb model with train_data2
xgb_test <- xgboost(booster="gbtree",
data = train_data4,
label = train_label2,
eta = 0.1,
max_depth = 6,
nround=10,
objective = "binary:hinge",
num_class = length(unique(train_label2))-1,
nthread = 4)
## [1] train-error:0.266667
## [2] train-error:0.266667
## [3] train-error:0.266667
## [4] train-error:0.266667
## [5] train-error:0.266667
## [6] train-error:0.266667
## [7] train-error:0.000000
## [8] train-error:0.000000
## [9] train-error:0.000000
## [10] train-error:0.000000
#Predict with the model classes in test_data
result3 = predict(xgb_test, test_data4)
head(test_label2)
## [1] 0 0 0 0 0 1
head(result3)
## [1] 0 0 0 1 0 0
#Wow it looks perfect! Too perfect...
#Lets do a prediction with all data
result_perf = predict(xgb_test, as.matrix(bio_values))
#Check what it predict for every Cell
summary(result_perf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 1.0000 1.0000 0.9144 1.0000 1.0000
res2 = raster(bioclim_data)
res2 = setValues(res2, result_perf)
plot(res2, main = "Prediction if species 5 or 6 occur")
plot(boundary_PAK, add = TRUE)
Sources: StatQuest with Josh Starmer (22.01.2018): StatQuest: Decision Trees. https://www.youtube.com/watch?v=7VeUPuFGJHk (Zugriff: 28.06.2021).
StatQuest with Josh Starmer (16.12.2019): XGBoost Part 1 (of 4): Regression. https://www.youtube.com/watch?v=OtD8wVaFm6E&t=213s (Zugriff: 28.06.2021).
StatQuest with Josh Starmer (13.01.2020): XGBoost Part 2 (of 4): Classification. https://www.youtube.com/watch?v=8b1JEDvenQU&t=923s (Zugriff: 28.06.2021).
StatQuest with Josh Starmer (25.03.2019): Gradient Boost Part 1 (of 4): Regression Main Ideas. https://www.youtube.com/watch?v=3CC4N4z3GJc (Zugriff:28.06.2021).
Chen, T., Guestrin, C. (2016): XGBoost: A Scalable Tree Boosting System. https://arxiv.org/pdf/1603.02754.pdf (Zugriff: 15.06.2021).