I am looking for a way to spatially clusters thousands of datapoints (potentially millions) based on distance, such that each cluster contains less than 5000 points.
This is a similar question to Problems with ST_ClusterDBSCAN cluster sizes . I would like to build upon the provided answer by using WITH RECURSIVE to automatically continue splitting clusters until they are all bellow a size.
This is the query I came up with (not complete):
WITH RECURSIVE clusterize(cid, csize, autopoi_ids, eps) AS (
SELECT cid, csize, unnest(poi_ids) as poi_id, eps
FROM (
SELECT cid, count(*) as csize, array_agg(id) as poi_ids, 0.05 as eps
FROM (
SELECT id, ST_ClusterDBSCAN(geometry, eps := 0.05, minpoints := 3) over () AS cid
FROM stats_autopoistat
) clusters
GROUP BY cid
) q
UNION ALL
SELECT cid, csize, unnest(poi_ids) as poi_id, eps
FROM (
SELECT cid, count(*) as csize, array_agg(id) as poi_ids, ( SELECT eps/2.0 FROM clusterize LIMIT 1 )/2.0 as eps
FROM (
SELECT id, (SELECT max(cid) FROM clusterize) + ST_ClusterDBSCAN(geometry, eps := ( SELECT eps/2.0 FROM clusterize LIMIT 1), minpoints := 0) over () AS cid
FROM clusterize
WHERE csize > 5000
) clusters
GROUP BY cid
) q
)
SELECT *
-- here filter out non-max cids for each poi_id
FROM clusterize limit 1000
However, it seems I am unable to refer to the recursive CTE inside a subquery, as Postgres complains with:
ERROR: recursive reference to query "clusterize" must not appear within a subquery
LINE 15: ..., array_agg(id) as poi_ids, ( SELECT eps/2.0 FROM clusterize...
I would like to know if this can even be come with WITH RECURSIVE given the limitations I encountered above.
The reason I want to accomplish this within Postgres and not Python is that the number of points to cluster will continue increasing. The table already has about 1 million rows, and I would like to avoid loading all this data into Python.
Answer
No time for more improving or testing, but: for a single, more generic recursive term, and possibly better performance, try
WITH RECURSIVE
params AS ( -- convenience variables for testing parameters
SELECT 10 AS max_size, -- max. cluster size
1 AS max_points, -- 'max_points' parameter
1 AS eps, -- 'eps' distance parameter
0.1 AS fraction -- decreasing fraction of/to 'eps' parameter
),
clst AS (
SELECT ARRAY[a._clst_id] AS _clst_ids,
1 - (1 * (SELECT fraction FROM params)) AS _eps,
ST_Collect(a.geom) AS geom
FROM (
SELECT id,
ST_SetSRID(ST_MakePointM(ST_X(geom), ST_Y(geom), id), 4326) AS geom,
ST_ClusterDBSCAN(geom, (SELECT eps FROM params), (SELECT max_points FROM params)) OVER() AS _clst_id
FROM
) AS a
GROUP BY
_clst_id
UNION ALL
SELECT CASE WHEN ST_NumGeometries(b.geom) > (SELECT max_size FROM params)
THEN a._clst_ids || b._clst_id
ELSE NULL
END AS _clst_ids,
a._eps - (a._eps * (SELECT fraction FROM params)) AS _eps,
b.geom AS geom
FROM clst AS a
CROSS JOIN LATERAL (
SELECT ST_Collect(c.geom) AS geom,
c._clst_id
FROM (
SELECT dmp.geom,
ST_ClusterDBSCAN(dmp.geom, a._eps, (SELECT max_points FROM params)) OVER() AS _clst_id
FROM LATERAL ST_DumpPoints(a.geom) AS dmp
) c
GROUP BY
c._clst_id
) b
WHERE ST_NumGeometries(a.geom) > (SELECT max_size FROM params)
)
SELECT ST_M(geom)::INT AS id,
ST_Force2d(geom) AS geom,
FROM (
SELECT ROW_NUMBER() OVER() AS clst_id,
(ST_DumpPoints(geom)).geom
FROM clst
WHERE _clst_ids IS NULL
) q
;
This approach ST_Collect
s points based on their _clst_id
and recursively processes those (each row in clst
) with ST_NumGeometries > max_size
using a LATERAL JOIN
. If a cluster has reached max_size
, it get's NULL
as _clst_ids
to mark it as a finished cluster.
I used params.fraction = 0.1
to decrease the eps
distance, which is pretty intense; smaller values will yield more precise results, but increase the execution time (probably) exponentially.
Since geometry aggregation makes it a pain in the to keep attributes along the way, and a join on geometric equality with very large tables to retrieve the original attributes is costly, I write the id
of each point into the M coordinate of the points and extract them later. This only works with numeric values.
If you are interested in MultiPoint geometries per cluster, just remove those parts and the dump in the final query.
It would probably be a better idea to write a function for this; I couldn't say if a DO ... WHILE
loop would perform better than the WITH RECURSIVE
implementation, but you could work with attributes a lot better (and probably more performant, especially if you are interested in other original attributes than the id
).
No comments:
Post a Comment