Tuesday 20 December 2016

R - multicore approach to extract raster values using spatial points


It might look like this question is duplicated, but I'm asking about data extraction by points, rather than by polygons, and I couldn't find hints about point extraction. Therefore, please bare with me.



I am working with a massive amount of climate files whose values need to be extracted and converted to data frames in order to serve as a database for a Shiny application.


Specifically, I need to extract daily temperature values from a raster stack for 655 locations (points) and store those values in a data frame. Given the amount of data, I estimate the final data frame to be around 30 GB in size.


Since it is such a big amount of data, I am looking to use a multicore (parallel) approach that benefits from a quad core (8 threads) system. I also consider to process these data on an Amazon EC2 instance, so I need to speed up the code as much as I can in order to prevent wasting time (i.e. money).


I know that the raster package provides the multicore function beginCluster that might help me, but according to the documentation it only works when extracting data using polygons, not points.


Please take a look at the code that reproduces my procedure using a simplified version of my raster stack:


library(raster)

# Create date sequence
idx <- seq(as.Date("2010/1/1"), as.Date("2099/12/31"), by = "day")


# Create raster stack and assign dates
# WARNING: raster stack will be ~ 400 MB in size
r <- raster(ncol=5, nrow=5)
s <- stack(lapply(1:length(idx), function(x) setValues(r, runif(ncell(r)))))
s <- setZ(s, idx)

# Create random spatial points
pts <- SpatialPoints(cbind(x=runif(655, -180, 180),
y=runif(655, -90, 90)),
proj4string=CRS(projection(s)))


# Extract values to a data frame
e.df <- extract(s, pts, df=TRUE)

# Fix resulting data frame
DF <- data.frame(t(e.df[-1])); row.names(DF)=NULL
DF <- cbind(getZ(s), DF)
names(DF)[1] <- "Date";
names(DF)[2:length(names(DF))] <- paste0('Point ', 1:length(pts))


In the code above, clearly the slowest part of the process is the data extraction itself.


Is there any way to improve the speed of this data extraction procedure?


Any way to implement it under a multicore function?



Answer



I ended up using an approach based on the snowfall package. It is quite simple, works really good and the point extraction function is as fast as the number of cores that you can use. The approach I used was inspired by this post, and here is my reproducible example:


library(raster)
library(snowfall)

# Create date sequence
idx <- seq(as.Date("2010/1/1"), as.Date("2099/12/31"), by = "day")


# Create raster stack and assign dates
# WARNING: raster stack will be ~ 400 MB in size
r <- raster(ncol=5, nrow=5)
s <- stack(lapply(1:length(idx), function(x) setValues(r, runif(ncell(r)))))
s <- setZ(s, idx)

# Create random spatial points
pts <- SpatialPoints(cbind(x=runif(655, -180, 180),
y=runif(655, -90, 90)),

proj4string=CRS(projection(s)))

# Extract values to a data frame - multicore approach
# First, convert raster stack to list of single raster layers
s.list <- unstack(s)
names(s.list) <- names(s)

# Now, create a R cluster using all the machine cores minus one
sfInit(parallel=TRUE, cpus=parallel:::detectCores()-1)


# Load the required packages inside the cluster
sfLibrary(raster)
sfLibrary(sp)

# Run parallelized 'extract' function and stop cluster
e.df <- sfSapply(s.list, extract, y=pts)
sfStop()

# Fix resulting data frame
DF <- data.frame(t(e.df)); row.names(DF)=NULL

DF <- cbind(getZ(s), DF)
names(DF)[1] <- "Date";
names(DF)[2:length(names(DF))] <- paste0('Point ', 1:length(pts))

# Check resulting data frame
head(DF)

I hope it can serve as a future reference for people interested in doing the same.


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...