Thursday, 17 March 2016

postgresql - Get a single cluster from cloud of points with specified maximum diameter in postgis


I have on the order of 2M Geometry Collections with N (median: 2, stddev: 6.6, 75th: 6, 95th: 21) Point objects (repeated points allowed and important) in a PostGIS table. I want to identify clusters within each collection, with a specified maximum diameter. From these I would identify a main cluster (which I assume exists), based on the number of points it contains (and in case of a tie, choose the cluster with the smallest area).


Since I will be using these clusters further in the database, I'd rather use a technique within PostgreSQL/PostGIS.


K-means clustering is not ideal because it requires specifying a priori the number of clusters, which is impossible due to the heterogeneity of the data. And specifying 1 cluster would return the entire collection.



Answer



I've written a bottom-up hierarchical clustering algorithm, it has extra parameters that might not be useful to other users, but those should be easy to remove in implementation. First, create a new type to have point ids and geometries.


CREATE TYPE pt AS (
gid character varying(32),
the_geom geometry(Point))


and a type with cluster id


CREATE TYPE clustered_pt AS (
collection_id character varying(16),
gid character varying(32),
the_geom geometry(Point)
cluster_id int)

Next the algorithm function


CREATE OR REPLACE FUNCTION buc(collection_id character varying, points pt[], radius integer)

RETURNS SETOF clustered_pt AS
$BODY$

DECLARE
srid int;
joined_clusters int[];

BEGIN

--If there's only 1 point, don't bother with the loop.

IF array_length(points,1)<2 THEN
RETURN QUERY SELECT collection_id::character varying(16), gid, the_geom, 1 FROM unnest(points);
RETURN;
END IF;

CREATE TEMPORARY TABLE IF NOT EXISTS points2 (LIKE pt) ON COMMIT DROP;

BEGIN
ALTER TABLE points2 ADD COLUMN cluster_id serial;
EXCEPTION

WHEN duplicate_column THEN --do nothing. Exception comes up when using this function multiple times
END;

TRUNCATE points2;
--inserting points in
INSERT INTO points2(gid, the_geom)
(SELECT (unnest(points)).* );

--Store the srid to reconvert points after, assumes all points have the same SRID
srid := ST_SRID(the_geom) FROM points2 LIMIT 1;


UPDATE points2 --transforming points to a UTM coordinate system so distances will be calculated in meters.
SET the_geom = ST_TRANSFORM(the_geom,26986);

LOOP
--If the smallest maximum distance between two clusters is greater than 2x the desired cluster radius, then there are no more clusters to be formed
IF (SELECT ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom)) FROM points2 a, points2 b
WHERE a.cluster_id <> b.cluster_id
GROUP BY a.cluster_id, b.cluster_id
ORDER BY ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom)) LIMIT 1)

> 2 * radius
THEN
EXIT;
END IF;

joined_clusters := ARRAY[a.cluster_id,b.cluster_id]
FROM points2 a, points2 b
WHERE a.cluster_id <> b.cluster_id
GROUP BY a.cluster_id, b.cluster_id
ORDER BY ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom))

LIMIT 1;

UPDATE points2
SET cluster_id = joined_clusters[1]
WHERE cluster_id = joined_clusters[2];

--If there's only 1 cluster left, exit loop
IF (SELECT COUNT(DISTINCT cluster_id) FROM points2) < 2 THEN
EXIT;


END IF;

END LOOP;

RETURN QUERY SELECT collection_id::character varying(16), gid, ST_TRANSFORM(the_geom, srid)::geometry(point), cluster_id FROM points2;
END;
$BODY$
LANGUAGE plpgsql

Because of the function syntax in psql, implementation goes as



WITH subq AS(
SELECT collection_id, ARRAY_AGG((gid, the_geom)::pt) AS points
FROM data
GROUP BY collection_id)
SELECT (clusters).* FROM
(SELECT buc(collection_id, points, radius) AS clusters FROM subq
) y;

If one were clustering one gigantic point cloud rather than many little clouds, it would be a good idea to add a GiST index on the temporary table. This takes 30 min to process 1.8M collections with 3M total points (i7-3930K @3.2GHz w/ 64GB ram). Any other suggestions for optimization welcome!


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...