Monday, 11 February 2019

memory issue when trying to buffer/union large dataset using postgis


I am trying to generated a dissolved buffer of a large point dataset (Ideally 29 million points - Address data in Great-Britain, but I receive the data by chunks of 1 million points, so these could be used instead).


After trying (and obviously failing!) different methods using ArcGIS, I came across this article: http://blog.cleverelephant.ca/2009/01/must-faster-unions-in-postgis-14.html


Since I have been willing to try postgis for a while a gave this a go. Obviously being an absolute beginner I struggled a bit to get the whole thing up and running, but I finally managed to import data and buffer it. I am struggling however with the ST_Union call, because of memory issues (I am using postgis 2.0.1 running on pretty old 32bits windows pc with little scope to change that in the near future).


I tried changing the memory settings of postgres, as described in the question below: Queries returning very big datasets in PostGIS


I believe I also created a spatial index properly on my geometry column. As far as I can tell, it didn't help.


I managed to run ST_Union on a 500 000 points subset, which is already pretty big. I think at that point the best for me might be to try using the Declare/Fetch method mentioned in the above question, however being a complete beginner I am having difficulties.


I believe it is in my interest to keep the slices as big as possible (because the ST_Union algorithm is fast and optimised). I think I would therefore just need to find a way to "slice" my ST_Union() queries to output one multipolygon for each 500000 rows. I would then be just one ST_Union on the resulting multipolygons away from the holy grail...


It is also possible that my query is sub-optimal. Or, there might by another great simple solution that I just haven't thought about or found online yet...



Here is a query I am currently using:


    drop table if exists ab_buffer_unioned;
create table ab_buffer_unioned (
gid serial primary key
, the_geom geometry
);

insert into
ab_buffer_unioned (the_geom)
select

st_Union(the_geom)
from
(select the_geom from ab_buffer limit 500000) as absubset;

Does anybody have any idea on how to fix my issues, or modify that query to add a loop or a cursor?


Many thanks


UPDATE


Many thanks to Paul for his suggestion below. The GeoHash trick worked great as can be seen below (see how the cluster size varies with the population density!) ST_Union applied to slices of data sorted by GeoHash



Answer



You seem to be trying to union all the shapes without any use of attributes to group, which is odd, but taking that as a given... you want to try and union things that are close together first. So



CREATE SEQUENCE bseq;

WITH ordered AS (
SELECT ST_Buffer(geom, 10) AS geom
FROM points
ORDER BY ST_GeoHash(geom)
),
grouped AS (
SELECT nextval('bseq') / 100000 AS id, ST_Union(geom) AS geom
FROM ordered

GROUP BY id
)
groupedfinal AS (
SELECT ST_Union(geom) AS geom
FROM grouped
)
SELECT * FROM groupedfinal;

You might still run out of memory since you're building one ultra huge geometry from millions of inputs. If you do actually have grouping that results in a smaller feature I'm sure things could be finessed.


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...