Monday, 29 February 2016

r - Cluster geometries that touch each other


I'm trying to cluster the geometries in a sf object that touch each other. For the moment I'm using the same approach that it's explained in the answer here, but the problem is that the same idea doesn't work if I consider a bigger number of geometries because the touching matrix becomes too memory-heavy and I'm not using its sparse properties.


A small and (I hope) reproducible example


# packages
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3

library(osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright

# download data
milan_primary_highways <- opq("Milan, Italy") %>%
add_osm_feature(key = "highway", value = "primary") %>%
osmdata_sf() %>%
trim_osmdata(bb_poly = getbb("Milan, Italy", format_out = "polygon")[[1]]) %>%
magrittr::use_series(osm_lines)


# Determine the non-sparse matrix of touching geometries
touching_matrix <- st_touches(milan_primary_highways, sparse = FALSE)
#> although coordinates are longitude/latitude, st_touches assumes that they are planar

# Cluster the geometries
hc <- hclust(as.dist(!touching_matrix), method="single")

# Cut the dendrogram
groups <- cutree(hc, h=0.5)


# Results
table(groups)
#> groups
#> 1 2 3 4 5 6 7 8 9 10 11 12
#> 1444 21 18 8 4 5 8 2 3 5 2 2
plot(st_geometry(milan_primary_highways), col = groups)


Is there an alternative approach that takes advantage of the sparse property of the touching matrix (that I cannot use because as.dist just accepts a numeric matrix). Do you want me to provide an example when this approach does not work? The error is simply "R cannot allocate a vector of size n GB" when I use the function hclust(as.dist(x)).


I don't know if it's important but I tried to explain the reason why I need this clustering here




Answer



Make an adjacency list:


touching_list = st_touches(milan_primary_highways)

this isn't NxN so won't grow - its a list of length N where each element is only the adjacent elements. This might help it scale up to larger problems.


Now use igraph to create a graph objects and get the connectivity:


library(igraph)
g = graph.adjlist(touching_list)
c = components(g)


The membership element of c is of length N and says which cluster it is in. It summarises the same as your example:


> table(c$membership)

1 2 3 4 5 6 7 8 9 10 11 12
1444 21 18 8 4 5 8 2 3 5 2 2

So if you want to assign it to the data, do:


 milan_primary_highways$groups = c$membership

No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...