Tuesday 23 February 2016

Efficient spatial joining for large dataset in R


I am working with rather large data frames that I often need to do a spatial join on. The fastest way I have come up with so far is this method:



library(rgdal)

download.file("http://gis.ices.dk/shapefiles/ICES_ecoregions.zip",
destfile = "ICES_ecoregions.zip")
unzip("ICES_ecoregions.zip")

# read eco region shapefiles
ices_eco <- rgdal::readOGR(".", "ICES_ecoregions_20150113_no_land", verbose = FALSE)
## Make a large data.frame (361,722 rows) with positions in the North Sea:
lon <- seq(-18.025, 32.025, by=0.05)

lat <- seq(48.025, 66.025, by=0.05)
c <- expand.grid(lon=lon, lat=lat)

# Get the Ecoregion for each position
pings <- SpatialPoints(c[c('lon','lat')],proj4string=ices_eco@proj4string)
c$area <- over(pings,ices_eco)$Ecoregion

But this takes a very long time and uses a lot of RAM, and will sometime come up with the Error: cannot allocate vector of size 460 Kb (if you can't reproduce the error, just make c larger).


Anyone can come up with a better/faster/more efficient solution?



Answer




First of all, performing a more efficient function could mean speed up the process on how quickly the computer can undertake that action (algorithmic efficiency). And...



Efficient R programming is the implementation of efficient programming practices in R. All languages are different, so efficient R code does not look like efficient code in another language. Many packages have been optimised for performance so, for some operations, achieving maximum computational efficiency may simply be a case of selecting the appropriate package and using it correctly. There are many ways to get the same result in R, and some are very slow. Therefore not writing slow code should be prioritized over writing fast code. (Taken from Efficient R book)



So, performing more efficient Spatial Joins for large data sets could imply write faster code. In this case, I assume that the over function from sp package have been optimized for performance and I won't write another function by myself or look for another R package.


Instead of that, I will show how to go parallel in R and speed up the process by using all the CPUs in your computer. Please try the commented and reproducible code example below:


Load data


# Load libraries
library(rgdal)


# Load data
download.file("http://gis.ices.dk/shapefiles/ICES_ecoregions.zip",
destfile = "ICES_ecoregions.zip")
unzip("ICES_ecoregions.zip")

# Read ecoregions shapefiles
ices_eco <- rgdal::readOGR(".", "ICES_ecoregions_20150113_no_land",
verbose = FALSE)

# Make a large data.frame (361,722 rows) with positions in the North Sea:

lon <- seq(-18.025, 32.025, by = 0.05)
lat <- seq(48.025, 66.025, by = 0.05)
df <- expand.grid(lon = lon, lat = lat)
df$area <- NA # Add empty attribute called "area" to assign ecoregions after

# Create a SpatialPointsDataFrame object from dataframe
coordinates(df) <- c("lon", "lat")

# Add projection to SpatialPointsDataFrame object
proj4string(df) <- ices_eco@proj4string


Aim: get the Ecoregion from ices_eco SpatialPolygonsDataFrame object for each position in the df SpatialPointsDataFrame object


# Parallel process: using multiple CPUs cores

# Load 'parallel' package for support Parallel computation in R
library('parallel')

# Calculate the number of cores (let one core be free for your Operative System)
no_cores <- detectCores() - 1


# Cut df in "n" parts
# Higher "n": less memory requiered but slower
# Lower "n": more memory requiered but faster
n <- 1000
parts <- split(x = 1:length(df), f = cut(1:length(df), n))

Initiate cluster like this if you are on Mac or Linux OSes


MacOS LinuxOS


# Initiate Cluster of CPU cores
# Note: you have to define all the used objects in the parallel process

# eg.: ices_eco, df, n, parts, etc. before making the cluster
cl <- makeCluster(no_cores, type = "FORK")

print(cl) # summary of the cluster

Initiate cluster like this if you are on Windows OS


WindowsOS


# Initiate Cluster of CPU cores
# Note: you have to define all the used objects in the parallel process
# eg.: ices_eco, df, n, parts, etc. before making the cluster

cl <- makeCluster(no_cores, type = "PSOCK")

# Load libraries on clusters
clusterEvalQ(cl = cl, expr = c(library('sp')))

# All the objects required to run the function
# Objects to export to clusters
clusterExport(cl = cl, varlist = c("ices_eco", "df", "parts", "n"))

print(cl) # summary of the cluster


Continue running the parallel function


# Parallelize sp::over function
# Returns a list with the overlays
system.time(

overParts <- parLapply(cl = cl, X = 1:n, fun = function(x) {
over <- over(df[parts[[x]],], ices_eco)
gc() # release memory
return(over)

})
)

# user system elapsed
# 1.050 1.150 627.111

# Stop Cluster of CPU cores
stopCluster(cl)

# Merge df with ecoregions

for (i in 1:n) {

message(paste("Merging part", i, "of", n))
df$area[parts[[i]]] <- as.character(overParts[[i]]$Ecoregion)

}

Control check


# Control check by random sampling of 20 elements
randomSampling <- sample(x = 1:length(df), size = 20)


chkA <- as.character(over(df[randomSampling,], ices_eco, returnList = FALSE)$Ecoregion) # direct method
chkB <- df$area[randomSampling] # sample df

# chkA should be equal to chkB
print(cbind(chkA, chkB))

# chkA chkB
# [1,] NA NA
# [2,] "Baltic Sea" "Baltic Sea"

# [3,] NA NA
# [4,] "Baltic Sea" "Baltic Sea"
# [5,] "Baltic Sea" "Baltic Sea"
# [6,] NA NA
# [7,] NA NA
# [8,] "Celtic Seas" "Celtic Seas"
# [9,] NA NA
# [10,] NA NA
# [11,] "Baltic Sea" "Baltic Sea"
# [12,] "Oceanic Northeast Atlantic" "Oceanic Northeast Atlantic"

# [13,] "Greater North Sea" "Greater North Sea"
# [14,] NA NA
# [15,] NA NA
# [16,] "Faroes" "Faroes"
# [17,] NA NA
# [18,] NA NA
# [19,] NA NA
# [20,] "Baltic Sea" "Baltic Sea"

Note: if you can access more than one computer, you can use a cluster of computers all going parallel.



More information here:



No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...