How to speed up the "union" of two shapefiles using geopandas?
I have two large shapefiles:
geodataframe_left.shape = (3610, 12) (500 MB) (link)
geodataframe_right.shape = (16396, 3) (200 MB) (link)
One file contains global province boundaries (GADM), The other shapefile contains Hydrobasin level 6. I've tried multiple approaches (PostGIS, Google Bigquery, ArcGIS) but nothing beats geopandas' one line command:
gdf_union = gpd.overlay(gdf_left, gdf_right, how='union')
there is however one problem. The command is extremely slow on larger datasets including mine. I've waited for almost an hour now without success. I'm running my code on a large amazon EC2 instance (32GB RAM).
How can I improve the performance of this operation? Is it possible to use multiprocessing for example or can I split the problem into smaller chunks so I can monitor progress?
UPDATE:
- I tried updating to geopandas 0.4.0 .. still no result after running for two hours.
- Used the use_sindex = True flag. Deprecated since 0.4.0 (always uses sindex)
- I implemented the tile approach and found weird behavior. Details below.
I created a fishnet grid of polygons of 10 x 10 degrees that I use to clip both layers before performing the union. The processing times per cell differ wildly and there are 4 cells that take > 10 minutes to process. One cell takes even 40 minutes.
When the process runs in parallel I get the following error (script continues though):
AttributeError: 'IndexStreamHandle' object has no attribute '_ptr'
Exception ignored in: >
Traceback (most recent call last):
File "/opt/anaconda3/envs/python35/lib/python3.5/site-packages/rtree/index.py", line 875, in __del__
self.destroy()
File "/opt/anaconda3/envs/python35/lib/python3.5/site-packages/rtree/index.py", line 863, in destroy
if self._ptr is not None:
What's interesting is that these cells are at the 180 meridian. I've uploaded the GPKG here
Maybe shapely doesn't like this hemisphere crossing stuff.
UPDATE 2:
Running ArcMap locally, the process takes appr. 10 minutes. Instead of using geopackages, I used shapefiles. file1 file2
Executing: Union "hybas_lev06_v1c_merged_fiona_V04 #;gadm36_1 #" C:\Users\Rutger.Hofste\Desktop\werkmap\union_benchmark\output\union_arcgis_global.shp ALL # GAPS
Start Time: Fri Nov 30 10:32:04 2018
Reading Features...
Processing Tiles...
Assembling Tile Features...
Succeeded at Fri Nov 30 10:43:02 2018 (Elapsed Time: 10 minutes 57 seconds)
I also tried "Union" (both default and SAGA) in QGIS but this process is also ridiculously slow. I ran for 30min and the progress bar was at 4%.
UPDATE 3: I'm implementing the tiled approach however for a small number of polygons the result of an intersection in shapely is not a multipologon or polygon but a geometrycollection with multiple geometries including lineStrings or just LineStrings. I would expect the result of an intersection to be polygon or multipolygon.
No comments:
Post a Comment