Monday, 5 August 2019

r - how to overlay a polygon over SpatialPointsDataFrame and preserving the SPDF data?


I have a SpatialPointsDataFrame with some additional data. I would like to extract those points inside a polygon and at the same time, preserve SPDF object and its corresponding data.


So far I've had little luck and resorted to matching and merging through a common ID, but this works only becaues I have gridded data with individual IDS.


Here's a quick example, I'm looking for points inside the red square.


library(sp)
set.seed(357)
pts <- data.frame(x = rnorm(100), y = rnorm(100), var1 = runif(100), var2 = sample(letters, 100, replace = TRUE))

coordinates(pts) <- ~ x + y
class(pts)
plot(pts)
axis(1); axis(2)

ply <- matrix(c(-1,-1, 1,-1, 1,1, -1,1, -1,-1), ncol = 2, byrow = TRUE)
ply <- SpatialPolygons(list(Polygons(list(Polygon(ply)), ID = 1)))
ply <- SpatialPolygonsDataFrame(Sr = ply, data = data.frame(polyvar = 357))
plot(ply, add = TRUE, border = "red")


The most obvious approach would be to use over, but this returns the data from the polygon.


> over(pts, ply)
polyvar
1 NA
2 357
3 357
4 NA
5 357
6 357

Answer




From the sp::over help:


 x = "SpatialPoints", y = "SpatialPolygons" returns a numeric
vector of length equal to the number of points; the number is
the index (number) of the polygon of ‘y’ in which a point
falls; NA denotes the point does not fall in a polygon; if a
point falls in multiple polygons, the last polygon is
recorded.

So if you convert your SpatialPolygonsDataFrame to SpatialPolygons you get back a vector of indexes and you can subset your points on NA:


> over(pts,as(ply,"SpatialPolygons"))

[1] NA 1 1 NA 1 1 NA NA 1 1 1 NA NA 1 1 1 1 1 NA NA NA 1 NA 1 NA
[26] 1 1 1 NA NA NA NA NA 1 1 NA NA NA 1 1 1 NA 1 1 1 NA NA NA 1 1
[51] 1 NA NA NA 1 NA 1 NA 1 NA NA 1 NA 1 1 NA 1 1 NA 1 NA 1 1 1 1
[76] 1 1 1 1 1 NA NA NA 1 NA 1 NA NA NA NA 1 1 NA 1 NA NA 1 1 1 NA

> nrow(pts)
[1] 100
> pts = pts[!is.na(over(pts,as(ply,"SpatialPolygons"))),]
> nrow(pts)
[1] 54

> head(pts@data)
var1 var2
2 0.04001092 v
3 0.58108350 v
5 0.85682609 q
6 0.13683264 y
9 0.13968804 m
10 0.97144627 o
>


For the doubters, here's evidence that the conversion overhead is not a problem:


Two functions - first Jeffrey Evans' method, then my original, then my hacked conversion, then a version based on gIntersects based on Josh O'Brien's answer:


evans <- function(pts,ply){
prid <- over(pts,ply)
ptid <- na.omit(prid)
pt.poly <- pts[as.numeric(as.character(row.names(ptid))),]
return(pt.poly)
}

rowlings <- function(pts,ply){

return(pts[!is.na(over(pts,as(ply,"SpatialPolygons"))),])
}

rowlings2 <- function(pts,ply){
class(ply) <- "SpatialPolygons"
return(pts[!is.na(over(pts,ply)),])
}

obrien <- function(pts,ply){
pts[apply(gIntersects(columbus,pts,byid=TRUE),1,sum)==1,]

}

Now for a real-world example, I've scattered some random points over the columbus data set:


require(spdep)
example(columbus)
pts=data.frame(
x=runif(100,5,12),
y=runif(100,10,15),
z=sample(letters,100,TRUE))
coordinates(pts)=~x+y


Looks good


plot(columbus)
points(pts)

Check the functions are doing the same thing:


> identical(evans(pts,columbus),rowlings(pts,columbus))
[1] TRUE

And run 500 times for benchmarking:



> system.time({for(i in 1:500){evans(pts,columbus)}})
user system elapsed
7.661 0.600 8.474
> system.time({for(i in 1:500){rowlings(pts,columbus)}})
user system elapsed
6.528 0.284 6.933
> system.time({for(i in 1:500){rowlings2(pts,columbus)}})
user system elapsed
5.952 0.600 7.222
> system.time({for(i in 1:500){obrien(pts,columbus)}})

user system elapsed
4.752 0.004 4.781

As per my intuition, its not a great overhead, in fact it might be less of an overhead than converting all the row indexes to character and back, or running na.omit to get missing values. Which incidentally leads to another failure mode of the evans function...


If a row of the polygons data frame is all NA (which is perfectly valid), then the overlay with SpatialPolygonsDataFrame for points in that polygon will produce an output data frame with all NAs, which evans() will then drop:


> columbus@data[1,]=rep(NA,20)
> columbus@data[5,]=rep(NA,20)
> columbus@data[17,]=rep(NA,20)
> columbus@data[15,]=rep(NA,20)
> set.seed(123)

> pts=data.frame(x=runif(100,5,12),y=runif(100,10,15),z=sample(letters,100,TRUE))
> coordinates(pts)=~x+y
> identical(evans(pts,columbus),rowlings(pts,columbus))
[1] FALSE
> dim(evans(pts,columbus))
[1] 27 1
> dim(rowlings(pts,columbus))
[1] 28 1
>


BUT gIntersects is faster, even with having to sweep the matrix to check intersections in R rather than in C code. I suspect its the prepared geometry skills of GEOS, creating spatial indexes - yeah, with prepared=FALSE it takes a bit longer, about 5.5 seconds.


I'm surprised there isn't a function to either straight return the indices or the points. When I wrote splancs 20 years ago the point-in-polygon functions had both...


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...