Part 3: Modelling with R
In this third part of the studio, we’ll use R to model Aedes albopictus distribution in Northern Italy. For that, we need to connect to GRASS via the rgrass
package in order to read occurrence data and predictors. The rgrass package is developed by Bivand (2022) and can be found at: https://github.com/rsbivand/rgrass/. See the vignette with further explanations and examples at: https://rsbivand.github.io/rgrass/.
rgrass
-
initGRASS()
: starts a GRASS GIS session from R -
execGRASS()
: executes GRASS GIS commands -
gmeta()
: shows GRASS location metadata -
read_VECT()
andread_RAST()
: read vector and raster maps from GRASS into R terra objects. -
write_VECT()
andwrite_RAST()
: write R terra objects into the GRASS GIS database
Usage
GRASS GIS and R can be used together in two ways:
A. Using R within a GRASS GIS session, i.e. starting R (or RStudio) from GRASS terminal
- type
R
orrstudio &
in the GRASS GIS terminal - load
rgrass
library - use
read_VECT()
,read_RAST()
to read data from GRASS into R - access GRASS GIS modules and database through
execGRASS()
- write data (back) to GRASS database with
write_VECT()
andwrite_RAST()
B. Using GRASS GIS within an R session, i.e. we connect to GRASS GIS database from within R (or RStudio).
- we need to start GRASS GIS with
initGRASS()
from R - we access GRASS GIS modules through
execGRASS()
- use
read_VECT()
,read_RAST()
,write_VECT()
andwrite_RAST()
to read data from and to GRASS database
SDM workflow
In this part of the Studio we’ll be covering the middle and right side of the SDM workflow, modeling and predictions.
There are several packages to carry out SDM, in this case we’ll be using SDMtune by Vignali et al. (2020). It provides functions covering the whole SDM workflow, from data preparation, to variable selection, optimization and evaluation. Have a look at the articles on the package website for further details: https://consbiol-unibern.github.io/SDMtune/index.html.
Let’s move to R
Load packages needed
Initialize GRASS
We’ll use option B, i.e., we’ll launch GRASS GIS in a defined location and mapset, from R
# path to GRASS binaries (run `grass --config path`)
grassbin <- system("grass --config path", intern = TRUE)
# path to GRASS database
grassdata <- path.expand("~/grass_ncsu_2023/grassdata/")
# path to location
location <- "eu_laea"
# path to mapset
mapset <- "italy_LST_daily"
# start GRASS GIS from R
initGRASS(gisBase = grassbin,
home = tempdir(),
gisDbase = grassdata,
location = location,
mapset = mapset,
override = TRUE,
remove_GISRC= TRUE)
## gisdbase /home/veroandreo/grass_ncsu_2023/grassdata/
## location eu_laea
## mapset italy_LST_daily
## rows 438
## columns 582
## north 2666000
## south 2228000
## west 4053000
## east 4635000
## nsres 1000
## ewres 1000
## projection:
## PROJCRS["ETRS89-extended / LAEA Europe",
## BASEGEOGCRS["ETRS89",
## ENSEMBLE["European Terrestrial Reference System 1989 ensemble",
## MEMBER["European Terrestrial Reference Frame 1989"],
## MEMBER["European Terrestrial Reference Frame 1990"],
## MEMBER["European Terrestrial Reference Frame 1991"],
## MEMBER["European Terrestrial Reference Frame 1992"],
## MEMBER["European Terrestrial Reference Frame 1993"],
## MEMBER["European Terrestrial Reference Frame 1994"],
## MEMBER["European Terrestrial Reference Frame 1996"],
## MEMBER["European Terrestrial Reference Frame 1997"],
## MEMBER["European Terrestrial Reference Frame 2000"],
## MEMBER["European Terrestrial Reference Frame 2005"],
## MEMBER["European Terrestrial Reference Frame 2014"],
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]],
## ENSEMBLEACCURACY[0.1]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4258]],
## CONVERSION["Europe Equal Area 2001",
## METHOD["Lambert Azimuthal Equal Area",
## ID["EPSG",9820]],
## PARAMETER["Latitude of natural origin",52,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8801]],
## PARAMETER["Longitude of natural origin",10,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8802]],
## PARAMETER["False easting",4321000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8806]],
## PARAMETER["False northing",3210000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8807]]],
## CS[Cartesian,2],
## AXIS["northing (Y)",north,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["easting (X)",east,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## USAGE[
## SCOPE["Statistical analysis."],
## AREA["Europe - European Union (EU) countries and candidates. Europe - onshore and offshore: Albania; Andorra; Austria; Belgium; Bosnia and Herzegovina; Bulgaria; Croatia; Cyprus; Czechia; Denmark; Estonia; Faroe Islands; Finland; France; Germany; Gibraltar; Greece; Hungary; Iceland; Ireland; Italy; Kosovo; Latvia; Liechtenstein; Lithuania; Luxembourg; Malta; Monaco; Montenegro; Netherlands; North Macedonia; Norway including Svalbard and Jan Mayen; Poland; Portugal including Madeira and Azores; Romania; San Marino; Serbia; Slovakia; Slovenia; Spain including Canary Islands; Sweden; Switzerland; Turkey; United Kingdom (UK) including Channel Islands and Isle of Man; Vatican City State."],
## BBOX[24.6,-35.58,84.73,44.83]],
## ID["EPSG",3035]]
Read vector data
Now we read in the occurrence data and the background points hosted in GRASS, convert them to sf
objects and display them with mapview
.
Read raster data
We read now all the variables that we derived from the daily LST time series.
# List rasters by pattern
worldclim <- execGRASS("g.list",
parameters = list(type = "raster",
pattern = "worldclim*"))
avg <- execGRASS("g.list",
parameters = list(type = "raster",
pattern = "avg*"))
median <- execGRASS("g.list",
parameters = list(type = "raster",
pattern = "median*",
exclude = "*[1-5]"))
# Concatenate map lists
to_import <- c(attributes(worldclim)$resOut,
attributes(avg)$resOut,
attributes(median)$resOut)
# Read raster layers
predictors <- list()
for (i in to_import){
predictors[i] <- read_RAST(i) }
# Stack rasters
predictors_r <- rast(predictors)
Let’s visualize imported maps. Note we convert terra object into raster because mapview
does not support terra yet.
Data preparation
Now that we have imported presence records, background points and predictor variables derived from LST time series, we need to prepare the data in a format called samples with data (SWD). This is basically a table with presence and background coordinates plus the corresponding values in the predictor variables.
# Variables for models
sp <- "Aedes albopictus"
presence_coords <- st_coordinates(presence)
background <- st_coordinates(background)
env <- predictors_r
# Prepare data: SWD
data_sp <- prepareSWD(species = sp,
p = presence_coords,
a = background,
env = env)
data_sp
## Object of class SWD
##
## Species: Aedes albopictus
## Presence locations: 83
## Absence locations: 1000
##
## Variables:
## ---------
## Continuous: worldclim_bio01 worldclim_bio02 worldclim_bio03 worldclim_bio04 worldclim_bio05 worldclim_bio06 worldclim_bio07 worldclim_bio10 worldclim_bio11 avg_autumnal_cooling avg_count_tmean_higher20_lower30 avg_spring_warming median_lower_m2_consec_days
## Categorical: NA
Define relevant variables
We define here some of the input values required through the workflow:
seed=49
perc_test = 0.2
k = 4
method="Maxent"
cor_th=0.7
perm=10
imp_th=10
Create train and test datasets
We will train the model with an 80% of presence samples, and leave the remaining 20% for evaluation at the end.
# Create training and test sets
c(train_sp, test_sp) %<-%
trainValTest(data_sp,
test = perc_test,
only_presence = TRUE,
seed = seed)
train_sp
## Object of class SWD
##
## Species: Aedes albopictus
## Presence locations: 66
## Absence locations: 1000
##
## Variables:
## ---------
## Continuous: worldclim_bio01 worldclim_bio02 worldclim_bio03 worldclim_bio04 worldclim_bio05 worldclim_bio06 worldclim_bio07 worldclim_bio10 worldclim_bio11 avg_autumnal_cooling avg_count_tmean_higher20_lower30 avg_spring_warming median_lower_m2_consec_days
## Categorical: NA
test_sp
## Object of class SWD
##
## Species: Aedes albopictus
## Presence locations: 17
## Absence locations: 1000
##
## Variables:
## ---------
## Continuous: worldclim_bio01 worldclim_bio02 worldclim_bio03 worldclim_bio04 worldclim_bio05 worldclim_bio06 worldclim_bio07 worldclim_bio10 worldclim_bio11 avg_autumnal_cooling avg_count_tmean_higher20_lower30 avg_spring_warming median_lower_m2_consec_days
## Categorical: NA
Create folds for cross-validation
As we will use cross-validation during the model training, we create the folds in advance. In this case we use random folds, but other methods exist. Since we are limited by the number of presence records, we will create only 4 folds. The algorithm will iteratively use 3 folds to train and 1 to validate.
# Create folds
ran_folds <- randomFolds(train_sp,
k = k,
only_presence = TRUE,
seed = seed)
Train a default Maxent model with CV
We will first train a so called full model, i.e., a model with all predictors, and from there we’ll remove those that are highly correlated and not so important.
# Train a full model
full_model_sp <- train(method = method,
data = train_sp,
folds = ran_folds)
full_model_sp
## Object of class SDMmodelCV
## Method: Maxent
##
## Species: Aedes albopictus
## Replicates: 4
## Presence locations: 66
## Absence locations: 1000
##
## Model configurations:
## --------------------
## fc: lqph
## reg: 1
## iter: 500
##
## Variables:
## ---------
## Continuous: worldclim_bio01 worldclim_bio02 worldclim_bio03 worldclim_bio04 worldclim_bio05 worldclim_bio06 worldclim_bio07 worldclim_bio10 worldclim_bio11 avg_autumnal_cooling avg_count_tmean_higher20_lower30 avg_spring_warming median_lower_m2_consec_days
## Categorical: NA
Let’s see the predictions of the full model
Remove less important variables
After discarding correlated variables, we will also remove variables that have a percent contribution or importance lower than 10%, again accounting for AUC.
# remove less important variables only if auc does not decrease
reduc_var_sp <- reduceVar(vs_sp,
th = imp_th,
metric = "auc",
test = TRUE,
permut = perm,
use_jk = TRUE,
interactive = FALSE)
Let’s explore the result
reduc_var_sp
## Object of class SDMmodelCV
## Method: Maxent
##
## Species: Aedes albopictus
## Replicates: 4
## Presence locations: 66
## Absence locations: 1000
##
## Model configurations:
## --------------------
## fc: lqph
## reg: 1
## iter: 500
##
## Variables:
## ---------
## Continuous: worldclim_bio04 worldclim_bio07 avg_count_tmean_higher20_lower30
## Categorical: NA
We need now to recreate the SWD object and train/test datasets, but with the selected variables only, in order to run the final model and make predictions.
# Get only relevant variables from the reduced model
retained_varnames <- names(reduc_var_sp@models[[1]]@data@data)
# Subset stack
env <- terra::subset(env, retained_varnames)
# SWD with the selected vars
subset_train_sp <- prepareSWD(species = sp,
p = presence_coords,
a = background,
env = env)
c(train_sp, test_sp) %<-%
trainValTest(subset_train_sp,
test = perc_test,
only_presence = TRUE,
seed = seed)
Run the best model and make predictions
Now we train the final model with the full training set, we no longer need the folds at this point. Note that we also use the feature classes (fc) and regularization (reg) from the best model obtained before. In this case, they are default values only, but if we also do hyper-parameter optimization, they might differ.
final_model_sp <- train(method = method,
data = train_sp,
fc = reduc_var_sp@models[[1]]@model@fc,
reg = reduc_var_sp@models[[1]]@model@reg)
Let’s make predictions now and explore the result
Write result back to GRASS
We can now write the raster with the final model’s predictions into the GRASS database.
write_RAST(map_sp_maxent,
"Aedes_albopictus_maxent",
flags = c("o","overwrite"))
## Over-riding projection check
## Importing raster map <Aedes_albopictus_maxent>...
## 0% 3% 6% 9% 12% 15% 18% 21% 24% 27% 30% 33% 36% 39% 42% 45% 48% 51% 54% 57% 60% 63% 66% 69% 72% 75% 78% 81% 84% 87% 90% 93% 96% 99% 100%
Check the map is there
Model evaluation
We want to know how good our model is, so in this step we use the test dataset that we separated in the beginning. An AUC of 0.5 would mean the model performs like flipping a coin. AUC is what we call a threshold independent evaluation metric.
# AUC
auc_maxent <- auc(final_model_sp, test = test_sp)
auc_maxent
## [1] 0.8666471
Usually, however, the result of SDM is converted into presence/absence maps. To determine which threshold to use we perform threshold dependent evaluations.
# Threshold dependent evaluation
th_maxent <- thresholds(final_model_sp,
type = "cloglog",
test = test_sp)
knitr::kable(th_maxent, format = 'html', digits = 2)
Threshold | Cloglog value | Fractional predicted area | Training omission rate | Test omission rate | P-values |
---|---|---|---|---|---|
Minimum training presence | 0.04 | 0.95 | 0.00 | 0 | 0 |
Equal training sensitivity and specificity | 0.50 | 0.28 | 0.27 | 0 | 0 |
Maximum training sensitivity plus specificity | 0.51 | 0.24 | 0.30 | 0 | 0 |
Equal test sensitivity and specificity | 0.53 | 0.18 | 0.47 | 0 | 0 |
Maximum test sensitivity plus specificity | 0.53 | 0.17 | 0.47 | 0 | 0 |
Let’s choose one threshold and create a binary map
Variable importance
Variable importance is an indicator of variable contribution to prediction.
vi_model_sp <- maxentVarImp(final_model_sp)
vi_model_sp
## Variable Percent_contribution Permutation_importance
## 1 avg_count_tmean_higher20_lower30 73.4133 69.7786
## 2 worldclim_bio07 21.2599 12.7664
## 3 worldclim_bio04 5.3268 17.4550
plotVarImp(vi_model_sp)
Response curves
Response curves give us an idea of the relationship between predictor variables and probability of occurrence.
my_rp <- function(i){
plotResponse(reduc_var_sp, i)
}
plotlist <- lapply(retained_varnames, my_rp)
labels <- LETTERS[1:length(retained_varnames)]
ggpubr::ggarrange(plotlist = plotlist, labels = labels)
We close the mapset and done
# close the mapset
unlink_.gislock()
Disclaimer
This is only a simple example for doing SDM and only the beginning… There are: