Tuesday, 14 February 2017

Increasing speed of crop, mask, & extract raster by many polygons in R?


I'm extracting the area and percent cover of different land use types from a raster based on several thousand polygon boundaries. I've found that the extract function works much faster if I iterate through each individual polygon and crop then mask the raster down to the size of the particular polygon. Nonetheless, it's pretty slow, and I'm wondering if anyone has any suggestions for improving the efficiency and speed of my code.


The only thing I've found related to this is this response by Roger Bivand who suggested using GDAL.open() and GDAL.close() as well as getRasterTable() and getRasterData(). I looked into those, but have had trouble with gdal in the past and don't know it well enough to know how to implement it.


Reproducible Example:


library(maptools)  ## For wrld_simpl
library(raster)

## Example SpatialPolygonsDataFrame
data(wrld_simpl) #polygon of world countries

bound <- wrld_simpl[1:25,] #name it this to subset to 25 countries and because my loop is set up with that variable

## Example RasterLayer
c <- raster(nrow=2e3, ncol=2e3, crs=proj4string(wrld_simpl), xmn=-180, xmx=180, ymn=-90, ymx=90)
c[] <- 1:length(c)

#plot, so you can see it
plot(c)
plot(bound, add=TRUE)


Fastest Method so far


result <- data.frame() #empty result dataframe 

system.time(
for (i in 1:nrow(bound)) { #this is the number of polygons to iterate through
single <- bound[i,] #selects a single polygon
clip1 <- crop(c, extent(single)) #crops the raster to the extent of the polygon, I do this first because it speeds the mask up
clip2 <- mask(clip1,single) #crops the raster to the polygon boundary

ext<-extract(clip2,single) #extracts data from the raster based on the polygon bound

tab<-lapply(ext,table) #makes a table of the extract output
s<-sum(tab[[1]]) #sums the table for percentage calculation
mat<- as.data.frame(tab)
mat2<- as.data.frame(tab[[1]]/s) #calculates percent
final<-cbind(single@data$NAME,mat,mat2$Freq) #combines into single dataframe
result<-rbind(final,result)
})

user system elapsed
39.39 0.11 39.52


Parallel Processing


Parallel processing cut the user time by half, but negated the benefit by doubling the system time. Raster uses this for the extract function, but unfortunately not for the crop or mask function. Unfortunately, this leaves a slighly larger amount of total elapsed time due to "waiting around" by the "IO."


beginCluster( detectCores() -1) #use all but one core

run code on multiple cores:


  user  system elapsed 
23.31 0.68 42.01

then end the cluster



endCluster()

Slow Method: The alternative method of doing an extract directly from the raster function takes a lot lot longer, and I'm not sure about the data management to get it into the form I want:


system.time(ext<-extract(c,bound))
user system elapsed
1170.64 14.41 1186.14

Answer



I have finally gotten around to improving this function. I found that for my purposes, it was fastest to rasterize() the polygon first and use getValues() instead of extract(). The rasterizing isn't much faster than the original code for tabulating raster values in small polygons, but it shines when it came to large polygon areas that had large rasters to be cropped and the values extracted. I also found getValues() was much faster than the extract() function.


I also figured out the multi-core processing using foreach().


I hope this is useful for other people who want an R solution for extracting raster values from a polygon overlay. This is similar to the "Tabulate Intersection" of ArcGIS, which did not work well for me, returning empty outputs after hours of processing, like this user.



#initiate multicore cluster and load packages
library(foreach)
library(doParallel)
library(tcltk)
library(sp)
library(raster)

cores<- 7
cl <- makeCluster(cores, output="") #output should make it spit errors
registerDoParallel(cl)


Here's the function:


multicore.tabulate.intersect<- function(cores, polygonlist, rasterlayer){ 
foreach(i=1:cores, .packages= c("raster","tcltk","foreach"), .combine = rbind) %dopar% {

mypb <- tkProgressBar(title = "R progress bar", label = "", min = 0, max = length(polygonlist[[i]]), initial = 0, width = 300)

foreach(j = 1:length(polygonlist[[i]]), .combine = rbind) %do% {
final<-data.frame()
tryCatch({ #not sure if this is necessary now that I'm using foreach, but it is useful for loops.


single <- polygonlist[[i]][j,] #pull out individual polygon to be tabulated

dir.create (file.path("c:/rtemp",i,j,single@data$OWNER), showWarnings = FALSE) #creates unique filepath for temp directory
rasterOptions(tmpdir=file.path("c:/rtemp",i,j, single@data$OWNER)) #sets temp directory - this is important b/c it can fill up a hard drive if you're doing a lot of polygons

clip1 <- crop(rasterlayer, extent(single)) #crop to extent of polygon
clip2 <- rasterize(single, clip1, mask=TRUE) #crops to polygon edge & converts to raster
ext <- getValues(clip2) #much faster than extract
tab<-table(ext) #tabulates the values of the raster in the polygon


mat<- as.data.frame(tab)
final<-cbind(single@data$OWNER,mat) #combines it with the name of the polygon
unlink(file.path("c:/rtemp",i,j,single@data$OWNER), recursive = TRUE,force = TRUE) #delete temporary files
setTkProgressBar(mypb, j, title = "number complete", label = j)

}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}) #trycatch error so it doesn't kill the loop

return(final)
}

#close(mypb) #not sure why but closing the pb while operating causes it to return an empty final dataset... dunno why.
}
}

So to use it, adjust the single@data$OWNER to fit with the column name of your identifying polygon (guess that could have been built into the function...) and put in:


myoutput <- multicore.tabulate.intersect(cores, polygonlist, rasterlayer)

No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...