Efficient spatial joining for large dataset in R

Tuesday, 23 February 2016

Efficient spatial joining for large dataset in R

I am working with rather large data frames that I often need to do a spatial join on. The fastest way I have come up with so far is this method:

library(rgdal)

download.file("http://gis.ices.dk/shapefiles/ICES_ecoregions.zip",
              destfile = "ICES_ecoregions.zip")
unzip("ICES_ecoregions.zip")

# read eco region shapefiles
ices_eco <- rgdal::readOGR(".", "ICES_ecoregions_20150113_no_land", verbose = FALSE)
## Make a large data.frame (361,722 rows) with positions in the North Sea:
lon <- seq(-18.025, 32.025, by=0.05)

lat <- seq(48.025, 66.025, by=0.05)
c <- expand.grid(lon=lon, lat=lat)

# Get the Ecoregion for each position
pings  <- SpatialPoints(c[c('lon','lat')],proj4string=ices_eco@proj4string)
c$area   <- over(pings,ices_eco)$Ecoregion

But this takes a very long time and uses a lot of RAM, and will sometime come up with the Error: cannot allocate vector of size 460 Kb (if you can't reproduce the error, just make c larger).

Anyone can come up with a better/faster/more efficient solution?

Answer

First of all, performing a more efficient function could mean speed up the process on how quickly the computer can undertake that action (algorithmic efficiency). And...

Efficient R programming is the implementation of efficient programming practices in R. All languages are different, so efficient R code does not look like efficient code in another language. Many packages have been optimised for performance so, for some operations, achieving maximum computational efficiency may simply be a case of selecting the appropriate package and using it correctly. There are many ways to get the same result in R, and some are very slow. Therefore not writing slow code should be prioritized over writing fast code. (Taken from Efficient R book)

So, performing more efficient Spatial Joins for large data sets could imply write faster code. In this case, I assume that the over function from sp package have been optimized for performance and I won't write another function by myself or look for another R package.

Instead of that, I will show how to go parallel in R and speed up the process by using all the CPUs in your computer. Please try the commented and reproducible code example below:

Load data

# Load libraries
library(rgdal)


# Load data
download.file("http://gis.ices.dk/shapefiles/ICES_ecoregions.zip",
              destfile = "ICES_ecoregions.zip")
unzip("ICES_ecoregions.zip")

# Read ecoregions shapefiles
ices_eco <- rgdal::readOGR(".", "ICES_ecoregions_20150113_no_land",
                           verbose = FALSE)

# Make a large data.frame (361,722 rows) with positions in the North Sea:

lon <- seq(-18.025, 32.025, by = 0.05)
lat <- seq(48.025, 66.025, by = 0.05)
df <- expand.grid(lon = lon, lat = lat)
df$area <- NA # Add empty attribute called "area" to assign ecoregions after

# Create a SpatialPointsDataFrame object from dataframe
coordinates(df) <- c("lon", "lat")

# Add projection to SpatialPointsDataFrame object
proj4string(df) <- ices_eco@proj4string

Aim: get the Ecoregion from ices_eco `SpatialPolygonsDataFrame` object for each position in the df `SpatialPointsDataFrame` object

# Parallel process: using multiple CPUs cores

# Load 'parallel' package for support Parallel computation in R
library('parallel')

# Calculate the number of cores (let one core be free for your Operative System)
no_cores <- detectCores() - 1


# Cut df in "n" parts
# Higher "n": less memory requiered but slower
# Lower "n": more memory requiered but faster
n <- 1000
parts <- split(x = 1:length(df), f = cut(1:length(df), n))

Initiate cluster like this if you are on Mac or Linux OSes

# Initiate Cluster of CPU cores
# Note: you have to define all the used objects in the parallel process

# eg.: ices_eco, df, n, parts, etc. before making the cluster
cl <- makeCluster(no_cores, type = "FORK")

print(cl) # summary of the cluster

Initiate cluster like this if you are on Windows OS

# Initiate Cluster of CPU cores
# Note: you have to define all the used objects in the parallel process
# eg.: ices_eco, df, n, parts, etc. before making the cluster

cl <- makeCluster(no_cores, type = "PSOCK")

# Load libraries on clusters
clusterEvalQ(cl = cl, expr = c(library('sp')))

# All the objects required to run the function
# Objects to export to clusters
clusterExport(cl = cl, varlist = c("ices_eco", "df", "parts", "n"))

print(cl) # summary of the cluster

Continue running the parallel function

# Parallelize sp::over function
# Returns a list with the overlays
system.time(

  overParts <- parLapply(cl = cl, X = 1:n, fun = function(x) {
    over <- over(df[parts[[x]],], ices_eco)
    gc() # release memory
    return(over)

  })
)

# user  system elapsed
# 1.050   1.150 627.111

# Stop Cluster of CPU cores
stopCluster(cl)

# Merge df with ecoregions

for (i in 1:n) {

  message(paste("Merging part", i, "of", n))
  df$area[parts[[i]]] <- as.character(overParts[[i]]$Ecoregion)

}

Control check

# Control check by random sampling of 20 elements
randomSampling <- sample(x = 1:length(df), size = 20)


chkA <- as.character(over(df[randomSampling,], ices_eco, returnList = FALSE)$Ecoregion) # direct method
chkB <- df$area[randomSampling] # sample df

# chkA should be equal to chkB
print(cbind(chkA, chkB))

# chkA                         chkB
# [1,] NA                           NA
# [2,] "Baltic Sea"                 "Baltic Sea"

# [3,] NA                           NA
# [4,] "Baltic Sea"                 "Baltic Sea"
# [5,] "Baltic Sea"                 "Baltic Sea"
# [6,] NA                           NA
# [7,] NA                           NA
# [8,] "Celtic Seas"                "Celtic Seas"
# [9,] NA                           NA
# [10,] NA                           NA
# [11,] "Baltic Sea"                 "Baltic Sea"
# [12,] "Oceanic Northeast Atlantic" "Oceanic Northeast Atlantic"

# [13,] "Greater North Sea"          "Greater North Sea"
# [14,] NA                           NA
# [15,] NA                           NA
# [16,] "Faroes"                     "Faroes"
# [17,] NA                           NA
# [18,] NA                           NA
# [19,] NA                           NA
# [20,] "Baltic Sea"                 "Baltic Sea"

Note: if you can access more than one computer, you can use a cluster of computers all going parallel.

Blog

Tuesday, 23 February 2016