Taking the union* of these two large shapefiles file1 file2 locally using ArcMap takes appr. 10 min.
Over the last few weeks, I've tried to obtain the same results using other, open-source approaches without success. I am curious why only ArcMap seems to generate results in a reasonable amount of time without errors.
The approaches I've tried so far:
The most straight-forward alternative. There are two "union" functions available in QGIS. One default (part of vector overlay), one based on SAGA. Both functions create the desired results on smaller samples but fail to run (or take forever) on my shapefiles. I tried running union for 30 minutes after which the progress bar was at appr. 4%
I've ingested the shapefiles on an AWS RDS instance running postGreSQL with PostGIS enabled. The approach is not so straight forward and requires a hideous long SQL query to obtain the same result. Most importantly, the query did not create a meaningful result on the full datasets. It did run properly on a sample though.
See separate question here
CREATE TABLE test.sqlislelijk AS
-- input data
with polys1 AS (
SELECT
pfaf_id as df1,
geom as g
FROM hybas06_v04
),
polys2 AS (
SELECT
aqid as df2,
geom as g
FROM y2018m11d14_rh_whymap_to_rds_v01_v01
),
-- intersections
intersections AS (
SELECT df1, df2, ST_INTERSECTION(a.g, b.g) i, a.g AS g1, b.g AS g2
FROM polys1 a, polys2 b WHERE ST_INTERSECTS(a.g, b.g)
),
-- per-row union of intersections with this row
diff1 AS (
SELECT df1, ST_UNION(i) i FROM intersections GROUP BY df1
),
diff2 AS (
SELECT df2, ST_UNION(i) i FROM intersections GROUP BY df2
),
-- various combinations of intersections
pairs AS (
SELECT df1, df2, i AS g FROM intersections
UNION ALL
SELECT
p.df1,
NULL,
CASE
WHEN i IS NULL THEN g
ELSE ST_DIFFERENCE(g, i)
END
FROM polys1 p LEFT JOIN diff1 d ON p.df1 = d.df1
UNION ALL
SELECT
NULL,
p.df2,
CASE
WHEN i IS NULL THEN g
ELSE ST_DIFFERENCE(g, i)
END
FROM polys2 p LEFT JOIN diff2 d ON p.df2 = d.df2
)
SELECT
df1 as pfaf_id,
df2 as aqid,
g as geom
FROM pairs WHERE NOT ST_IsEmpty(g);
Google's GIS extension for BQ looks promising but I experienced weird behavior and internal errors, probably due to mixing of GEOMETRY and GEOGRAPHY objects. Also, the syntax is similar to PostGIS and a command of 1 line in geopandas takes >20 lines in SQL.
Probably my favorite approach. The out-of the box approach does't give any results after running it for several hours. My latest approach includes tiling the shapefiles, use multiprocessing on each tile and merge them. Not very straight forward and one 10x10 degree tile till takes 40+ minutes to calculate.
See Overlay Union Geopandas improve performance
It is the first time in my career that I am impressed by the performance of ArcMap. It seems that ArcMap handles this use-case much, much better than anything else. The issue seems persistent over all solutions I've tried so far. Is there something wrong in the underlying libraries of the open-source/alternative solutions?
- Note that Union in ArcMap means something different than in the Shapely / PostGIS world.
No comments:
Post a Comment