postgis - Applying ST_ClusterDBSCAN repeatedly until cluster sizes small enough

Sunday, 8 December 2019

postgis - Applying ST_ClusterDBSCAN repeatedly until cluster sizes small enough

I am looking for a way to spatially clusters thousands of datapoints (potentially millions) based on distance, such that each cluster contains less than 5000 points.

This is a similar question to Problems with ST_ClusterDBSCAN cluster sizes . I would like to build upon the provided answer by using WITH RECURSIVE to automatically continue splitting clusters until they are all bellow a size.

This is the query I came up with (not complete):

WITH RECURSIVE clusterize(cid, csize, autopoi_ids, eps) AS (
    SELECT cid, csize, unnest(poi_ids) as poi_id, eps
    FROM (
        SELECT cid, count(*) as csize, array_agg(id) as poi_ids, 0.05 as eps
        FROM (

            SELECT id, ST_ClusterDBSCAN(geometry, eps := 0.05, minpoints := 3) over () AS cid
            FROM stats_autopoistat
        ) clusters
        GROUP BY cid
    ) q

    UNION ALL

    SELECT cid, csize, unnest(poi_ids) as poi_id, eps
    FROM (

        SELECT cid, count(*) as csize, array_agg(id) as poi_ids, ( SELECT eps/2.0 FROM clusterize LIMIT 1 )/2.0 as eps
        FROM (
            SELECT id, (SELECT max(cid) FROM clusterize) + ST_ClusterDBSCAN(geometry, eps := ( SELECT eps/2.0 FROM clusterize LIMIT 1), minpoints := 0) over () AS cid
            FROM clusterize
            WHERE csize > 5000
        ) clusters
        GROUP BY cid
    ) q
)
SELECT *

-- here filter out non-max cids for each poi_id
FROM clusterize limit 1000

However, it seems I am unable to refer to the recursive CTE inside a subquery, as Postgres complains with:

ERROR:  recursive reference to query "clusterize" must not appear within a subquery
LINE 15: ..., array_agg(id) as poi_ids, ( SELECT eps/2.0 FROM clusterize...

I would like to know if this can even be come with WITH RECURSIVE given the limitations I encountered above.

The reason I want to accomplish this within Postgres and not Python is that the number of points to cluster will continue increasing. The table already has about 1 million rows, and I would like to avoid loading all this data into Python.

Answer

No time for more improving or testing, but: for a single, more generic recursive term, and possibly better performance, try

WITH RECURSIVE
    params AS (                      -- convenience variables for testing parameters
        SELECT  10 AS max_size,      -- max. cluster size
                1 AS max_points,     -- 'max_points' parameter
                1 AS eps,            -- 'eps' distance parameter
                0.1 AS fraction      -- decreasing fraction of/to 'eps' parameter
    ),
    clst AS (
        SELECT  ARRAY[a._clst_id] AS _clst_ids,

                1 - (1 * (SELECT fraction FROM params)) AS _eps,
                ST_Collect(a.geom) AS geom
        FROM    (
            SELECT  id,
                    ST_SetSRID(ST_MakePointM(ST_X(geom), ST_Y(geom), id), 4326) AS geom,
                    ST_ClusterDBSCAN(geom, (SELECT eps FROM params), (SELECT max_points FROM params)) OVER() AS _clst_id
            FROM    
        ) AS a
        GROUP BY
                _clst_id

        UNION ALL
        SELECT  CASE WHEN ST_NumGeometries(b.geom) > (SELECT max_size FROM params)
                    THEN a._clst_ids || b._clst_id
                    ELSE NULL
                END AS _clst_ids,
                a._eps - (a._eps * (SELECT fraction FROM params)) AS _eps,
                b.geom AS geom
        FROM    clst AS a
        CROSS JOIN LATERAL (
            SELECT  ST_Collect(c.geom) AS geom,

                    c._clst_id
            FROM    (
                SELECT  dmp.geom,
                        ST_ClusterDBSCAN(dmp.geom, a._eps, (SELECT max_points FROM params)) OVER() AS _clst_id
                FROM    LATERAL ST_DumpPoints(a.geom) AS dmp
            ) c
            GROUP BY
                    c._clst_id
        ) b
        WHERE   ST_NumGeometries(a.geom) > (SELECT max_size FROM params)

    )
SELECT  ST_M(geom)::INT AS id,
        ST_Force2d(geom) AS geom,
FROM    (
    SELECT  ROW_NUMBER() OVER() AS clst_id,
            (ST_DumpPoints(geom)).geom
    FROM    clst
    WHERE   _clst_ids IS NULL
) q
;

This approach ST_Collects points based on their _clst_id and recursively processes those (each row in clst) with ST_NumGeometries > max_size using a LATERAL JOIN. If a cluster has reached max_size, it get's NULL as _clst_ids to mark it as a finished cluster.

I used params.fraction = 0.1 to decrease the eps distance, which is pretty intense; smaller values will yield more precise results, but increase the execution time (probably) exponentially.

Since geometry aggregation makes it a pain in the to keep attributes along the way, and a join on geometric equality with very large tables to retrieve the original attributes is costly, I write the id of each point into the M coordinate of the points and extract them later. This only works with numeric values.

If you are interested in MultiPoint geometries per cluster, just remove those parts and the dump in the final query.

It would probably be a better idea to write a function for this; I couldn't say if a DO ... WHILE loop would perform better than the WITH RECURSIVE implementation, but you could work with attributes a lot better (and probably more performant, especially if you are interested in other original attributes than the id).

Blog

Sunday, 8 December 2019

postgis - Applying ST_ClusterDBSCAN repeatedly until cluster sizes small enough

No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?