r - how to overlay a polygon over SpatialPointsDataFrame and preserving the SPDF data?

Monday, 5 August 2019

r - how to overlay a polygon over SpatialPointsDataFrame and preserving the SPDF data?

I have a SpatialPointsDataFrame with some additional data. I would like to extract those points inside a polygon and at the same time, preserve SPDF object and its corresponding data.

So far I've had little luck and resorted to matching and merging through a common ID, but this works only becaues I have gridded data with individual IDS.

Here's a quick example, I'm looking for points inside the red square.

library(sp)
set.seed(357)
pts <- data.frame(x = rnorm(100), y = rnorm(100), var1 = runif(100), var2 = sample(letters, 100, replace = TRUE))

coordinates(pts) <- ~ x + y
class(pts)
plot(pts)
axis(1); axis(2)

ply <- matrix(c(-1,-1, 1,-1, 1,1, -1,1, -1,-1), ncol = 2, byrow = TRUE)
ply <- SpatialPolygons(list(Polygons(list(Polygon(ply)), ID = 1)))
ply <- SpatialPolygonsDataFrame(Sr = ply, data = data.frame(polyvar = 357))
plot(ply, add = TRUE, border = "red")

The most obvious approach would be to use over, but this returns the data from the polygon.

> over(pts, ply)
    polyvar
1        NA
2       357
3       357
4        NA
5       357
6       357

Answer

From the sp::over help:

 x = "SpatialPoints", y = "SpatialPolygons" returns a numeric
      vector of length equal to the number of points; the number is
      the index (number) of the polygon of ‘y’ in which a point
      falls; NA denotes the point does not fall in a polygon; if a
      point falls in multiple polygons, the last polygon is
      recorded.

So if you convert your SpatialPolygonsDataFrame to SpatialPolygons you get back a vector of indexes and you can subset your points on NA:

> over(pts,as(ply,"SpatialPolygons"))

  [1] NA  1  1 NA  1  1 NA NA  1  1  1 NA NA  1  1  1  1  1 NA NA NA  1 NA  1 NA
 [26]  1  1  1 NA NA NA NA NA  1  1 NA NA NA  1  1  1 NA  1  1  1 NA NA NA  1  1
 [51]  1 NA NA NA  1 NA  1 NA  1 NA NA  1 NA  1  1 NA  1  1 NA  1 NA  1  1  1  1
 [76]  1  1  1  1  1 NA NA NA  1 NA  1 NA NA NA NA  1  1 NA  1 NA NA  1  1  1 NA

> nrow(pts)
[1] 100
> pts = pts[!is.na(over(pts,as(ply,"SpatialPolygons"))),]
> nrow(pts)
[1] 54

> head(pts@data)
         var1 var2
2  0.04001092    v
3  0.58108350    v
5  0.85682609    q
6  0.13683264    y
9  0.13968804    m
10 0.97144627    o
>

For the doubters, here's evidence that the conversion overhead is not a problem:

Two functions - first Jeffrey Evans' method, then my original, then my hacked conversion, then a version based on gIntersects based on Josh O'Brien's answer:

evans <- function(pts,ply){
  prid <- over(pts,ply)
  ptid <- na.omit(prid) 
  pt.poly <- pts[as.numeric(as.character(row.names(ptid))),]
  return(pt.poly)
}

rowlings <- function(pts,ply){

  return(pts[!is.na(over(pts,as(ply,"SpatialPolygons"))),])
}

rowlings2 <- function(pts,ply){
  class(ply) <- "SpatialPolygons"
  return(pts[!is.na(over(pts,ply)),])
}

obrien <- function(pts,ply){
pts[apply(gIntersects(columbus,pts,byid=TRUE),1,sum)==1,]

}

Now for a real-world example, I've scattered some random points over the columbus data set:

require(spdep)
example(columbus)
pts=data.frame(
    x=runif(100,5,12),
    y=runif(100,10,15),
    z=sample(letters,100,TRUE))
coordinates(pts)=~x+y

Looks good

plot(columbus)
points(pts)

Check the functions are doing the same thing:

> identical(evans(pts,columbus),rowlings(pts,columbus))
[1] TRUE

And run 500 times for benchmarking:

> system.time({for(i in 1:500){evans(pts,columbus)}})
   user  system elapsed 
  7.661   0.600   8.474 
> system.time({for(i in 1:500){rowlings(pts,columbus)}})
   user  system elapsed 
  6.528   0.284   6.933 
> system.time({for(i in 1:500){rowlings2(pts,columbus)}})
   user  system elapsed 
  5.952   0.600   7.222 
> system.time({for(i in 1:500){obrien(pts,columbus)}})

  user  system elapsed 
  4.752   0.004   4.781

As per my intuition, its not a great overhead, in fact it might be less of an overhead than converting all the row indexes to character and back, or running na.omit to get missing values. Which incidentally leads to another failure mode of the evans function...

If a row of the polygons data frame is all NA (which is perfectly valid), then the overlay with SpatialPolygonsDataFrame for points in that polygon will produce an output data frame with all NAs, which evans() will then drop:

> columbus@data[1,]=rep(NA,20)
> columbus@data[5,]=rep(NA,20)
> columbus@data[17,]=rep(NA,20)
> columbus@data[15,]=rep(NA,20)
> set.seed(123)

> pts=data.frame(x=runif(100,5,12),y=runif(100,10,15),z=sample(letters,100,TRUE))
> coordinates(pts)=~x+y
> identical(evans(pts,columbus),rowlings(pts,columbus))
[1] FALSE
> dim(evans(pts,columbus))
[1] 27  1
> dim(rowlings(pts,columbus))
[1] 28  1
>

BUT gIntersects is faster, even with having to sweep the matrix to check intersections in R rather than in C code. I suspect its the prepared geometry skills of GEOS, creating spatial indexes - yeah, with prepared=FALSE it takes a bit longer, about 5.5 seconds.

I'm surprised there isn't a function to either straight return the indices or the points. When I wrote splancs 20 years ago the point-in-polygon functions had both...

Blog

Monday, 5 August 2019

r - how to overlay a polygon over SpatialPointsDataFrame and preserving the SPDF data?

No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?