I have on the order of 2M Geometry Collections with N (median: 2, stddev: 6.6, 75th: 6, 95th: 21) Point objects (repeated points allowed and important) in a PostGIS table. I want to identify clusters within each collection, with a specified maximum diameter. From these I would identify a main cluster (which I assume exists), based on the number of points it contains (and in case of a tie, choose the cluster with the smallest area).
Since I will be using these clusters further in the database, I'd rather use a technique within PostgreSQL/PostGIS.
K-means clustering is not ideal because it requires specifying a priori the number of clusters, which is impossible due to the heterogeneity of the data. And specifying 1 cluster would return the entire collection.
Answer
I've written a bottom-up hierarchical clustering algorithm, it has extra parameters that might not be useful to other users, but those should be easy to remove in implementation. First, create a new type to have point ids and geometries.
CREATE TYPE pt AS (
gid character varying(32),
the_geom geometry(Point))
and a type with cluster id
CREATE TYPE clustered_pt AS (
collection_id character varying(16),
gid character varying(32),
the_geom geometry(Point)
cluster_id int)
Next the algorithm function
CREATE OR REPLACE FUNCTION buc(collection_id character varying, points pt[], radius integer)
RETURNS SETOF clustered_pt AS
$BODY$
DECLARE
srid int;
joined_clusters int[];
BEGIN
--If there's only 1 point, don't bother with the loop.
IF array_length(points,1)<2 THEN
RETURN QUERY SELECT collection_id::character varying(16), gid, the_geom, 1 FROM unnest(points);
RETURN;
END IF;
CREATE TEMPORARY TABLE IF NOT EXISTS points2 (LIKE pt) ON COMMIT DROP;
BEGIN
ALTER TABLE points2 ADD COLUMN cluster_id serial;
EXCEPTION
WHEN duplicate_column THEN --do nothing. Exception comes up when using this function multiple times
END;
TRUNCATE points2;
--inserting points in
INSERT INTO points2(gid, the_geom)
(SELECT (unnest(points)).* );
--Store the srid to reconvert points after, assumes all points have the same SRID
srid := ST_SRID(the_geom) FROM points2 LIMIT 1;
UPDATE points2 --transforming points to a UTM coordinate system so distances will be calculated in meters.
SET the_geom = ST_TRANSFORM(the_geom,26986);
LOOP
--If the smallest maximum distance between two clusters is greater than 2x the desired cluster radius, then there are no more clusters to be formed
IF (SELECT ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom)) FROM points2 a, points2 b
WHERE a.cluster_id <> b.cluster_id
GROUP BY a.cluster_id, b.cluster_id
ORDER BY ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom)) LIMIT 1)
> 2 * radius
THEN
EXIT;
END IF;
joined_clusters := ARRAY[a.cluster_id,b.cluster_id]
FROM points2 a, points2 b
WHERE a.cluster_id <> b.cluster_id
GROUP BY a.cluster_id, b.cluster_id
ORDER BY ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom))
LIMIT 1;
UPDATE points2
SET cluster_id = joined_clusters[1]
WHERE cluster_id = joined_clusters[2];
--If there's only 1 cluster left, exit loop
IF (SELECT COUNT(DISTINCT cluster_id) FROM points2) < 2 THEN
EXIT;
END IF;
END LOOP;
RETURN QUERY SELECT collection_id::character varying(16), gid, ST_TRANSFORM(the_geom, srid)::geometry(point), cluster_id FROM points2;
END;
$BODY$
LANGUAGE plpgsql
Because of the function syntax in psql, implementation goes as
WITH subq AS(
SELECT collection_id, ARRAY_AGG((gid, the_geom)::pt) AS points
FROM data
GROUP BY collection_id)
SELECT (clusters).* FROM
(SELECT buc(collection_id, points, radius) AS clusters FROM subq
) y;
If one were clustering one gigantic point cloud rather than many little clouds, it would be a good idea to add a GiST index on the temporary table. This takes 30 min to process 1.8M collections with 3M total points (i7-3930K @3.2GHz w/ 64GB ram). Any other suggestions for optimization welcome!
No comments:
Post a Comment