Sunday, 2 December 2018

Spatial clustering with PostGIS?


I'm looking for spatial clustering algorithm for using it within PostGIS-enabled database for point features. I'm going to write plpgsql function that takes distance between points within the same cluster as input. At the output function returns array of clusters. The most obvious solution is to build buffer zones specified distance around the feature and search for features into this buffer. If such features exist then continue to build a buffer around them, etc. If such features not exist that means cluster building is completed. Maybe there are some clever solutions?



Answer



There are at least two good clustering methods for PostGIS: k-means (via kmeans-postgresql extension) or clustering geometries within a threshold distance (PostGIS 2.2)






Installation: You need to compile and install this from source code, which is easier to do on *NIX than Windows (I don't know where to start). If you have PostgreSQL installed from packages, make sure you also have the development packages (e.g., postgresql-devel for CentOS).


Download, extract, build and install:


wget http://api.pgxn.org/dist/kmeans/1.1.0/kmeans-1.1.0.zip
unzip kmeans-1.1.0.zip
cd kmeans-1.1.0/
make USE_PGXS=1
sudo make install


Enable the extension in a database (using psql, pgAdmin, etc.):


CREATE EXTENSION kmeans;

Usage/Example: You should have a table of points somewhere (I drew a bunch of pseudo random points in QGIS). Here is an example with what I did:


SELECT kmeans, count(*), ST_Centroid(ST_Collect(geom)) AS geom
FROM (
SELECT kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER (), geom
FROM rand_point
) AS ksub
GROUP BY kmeans

ORDER BY kmeans;

the 5 I provided in the second argument of the kmeans window function is the K integer to produce five clusters. You can change this to whatever integer you want.


Below is the 31 pseudo random points I drew and the five centroids with the label showing the count in each cluster. This was created using the above SQL query.


Kmeans




You can also attempt to illustrate where these clusters are with ST_MinimumBoundingCircle:


SELECT kmeans, ST_MinimumBoundingCircle(ST_Collect(geom)) AS circle
FROM (
SELECT kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER (), geom

FROM rand_point
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;

Kmeans2





This aggregate function is included with PostGIS 2.2, and returns an array of GeometryCollections where all the components are within a distance of each other.


Here is an example use, where a distance of 100.0 is the threshold that results in 5 different clusters:



SELECT row_number() over () AS id,
ST_NumGeometries(gc),
gc AS geom_collection,
ST_Centroid(gc) AS centroid,
ST_MinimumBoundingCircle(gc) AS circle,
sqrt(ST_Area(ST_MinimumBoundingCircle(gc)) / pi()) AS radius
FROM (
SELECT unnest(ST_ClusterWithin(geom, 100)) gc
FROM rand_point
) f;


ClusterWithin100


The largest middle cluster has a enclosing circle radius of 65.3 units or about 130, which is larger than the threshold. This is because the individual distances between the member geometries is less than the threshold, so it ties it together as one larger cluster.


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...