I'm having real problems cleaning up and reducing the size of a Shapefile which is published by the UK's Environment Agency.
It shows the extents of the LiDAR open data they publish: each polygon is a survey flight, and there are fields for the resolution, the date of the flight and so on.
Eventually I'd like to end up with a dissolved (merging all features) and simplified Shapefile which is much smaller in size. At the moment:
- The .shp is 360.6MB
- The .dbp is 26.9MB
- There are 121,753 polygons
I think one of the reasons the file is so large is that there are many small 'specks' of data (which aren't important for my purposes):
What I've tried so far:
- Dissolving with QGIS: this seemed to make no progress on such a big file so I cancelled it after a while
- Dissolving with OGR (
ogr2ogr dissolved.shp LIDAR_composite_extents_2015.shp -dialect sqlite -sql "SELECT ST_Union(geometry) AS geometry FROM LIDAR_composite_extents_2015"
): after a few hours I would get an error like this:GEOS error: TopologyException: Input geom 0 is invalid: Ring Self-intersection at or near point 221912.50000000093 50580.000000001863 at 221912.50000000093 50580.000000001863
– I guess this will just be the first of many errors - Cleaning the Shapefile, first with QGIS's Check Geometry Validity, then with GRASS's v.clean (I tried bpol), but the cleaned file still fails dissolve (I also tried adding a zero buffer)
- Converting the multipart polygons to singleparts, adding a geometry column and removing features smaller than a certain area. This took about 100MB off the filesize, but didn't affect the 'holes' within polygons – to get rid of those I tried to make a difference layer, but the difference operation consistently fails (if I check ignore invalid geometry, it creates lots of features which are visible in the attribute table, but doesn't display them).
I will simplify the file in the end, but don't want to do this before dissolving in case I introduce slivers.
Should I use a different tool within v.clean?
I have not yet tried to split the file into regions and dissolve those, then dissolve the regions.
Answer
The source data seems to be rather hard to handle as vectors as you have noticed. However, this workaroung that goes through an intermediate raster file works well and it is very fast.
1) Use gdal_rasterize http://www.gdal.org/gdal_rasterize.html and for making a black/white raster from the layer
gdal_rasterize -burn 100 -ts 1000 1000 -ot byte LIDAR_composite_extents_2015.shp burned.tif
2) Use gdal_polygonize http://www.gdal.org/gdal_polygonize.html for creating polygons from the raster
gdal_polygonize -f "ESRI Shapefile" burned.tif merged.shp
Select all the polygons which have "0" as the value of the "DN" attribute and delete them. As a result you will have a small 2 MB shapefile about the lidar coverage.
Because of rasterize-polygonize cycle some precision of the original vectors is lost. Increase the output raster size if you need better resolution. Depending on your use case you can perhaps use the raster image as an index without going back to vectors.
No comments:
Post a Comment