Saturday, 9 July 2016

clip - Parallelising GIS operations in PyQGIS?


A common requirement in GIS is to apply a processing tool to a number of files or apply a process for a number of features in one file to another file.


Much of these operations are embarrassingly parallel in that the results of the calculations in no way influence any other operation in the loop. Not only that but often the input files are each distinct.


A classic case in point is the tiling out of shape files against files containing polygons to clip them against.



Here is a (tested) classical procedural method to achieve this in a python script for QGIS. (fyi the output of temporary memory files to real files more than halved the time to process my test files)


import processing
import os
input_file="/path/to/input_file.shp"
clip_polygons_file="/path/to/polygon_file.shp"
output_folder="/tmp/test/"
input_layer = QgsVectorLayer(input_file, "input file", "ogr")
QgsMapLayerRegistry.instance().addMapLayer(input_layer)
tile_layer = QgsVectorLayer(clip_polygons_file, "clip_polys", "ogr")
QgsMapLayerRegistry.instance().addMapLayer(tile_layer)

tile_layer_dp=input_layer.dataProvider()
EPSG_code=int(tile_layer_dp.crs().authid().split(":")[1])
tile_no=0
clipping_polygons = tile_layer.getFeatures()
for clipping_polygon in clipping_polygons:
print "Tile no: "+str(tile_no)
tile_no+=1
geom = clipping_polygon.geometry()
clip_layer=QgsVectorLayer("Polygon?crs=epsg:"+str(EPSG_code)+\
"&field=id:integer&index=yes","clip_polygon", "memory")

clip_layer_dp = clip_layer.dataProvider()
clip_layer.startEditing()
clip_layer_feature = QgsFeature()
clip_layer_feature.setGeometry(geom)
(res, outFeats) = clip_layer_dp.addFeatures([clip_layer_feature])
clip_layer.commitChanges()
clip_file = os.path.join(output_folder,"tile_"+str(tile_no)+".shp")
write_error = QgsVectorFileWriter.writeAsVectorFormat(clip_layer, \
clip_file, "system", \
QgsCoordinateReferenceSystem(EPSG_code), "ESRI Shapefile")

QgsMapLayerRegistry.instance().addMapLayer(clip_layer)
output_file = os.path.join(output_folder,str(tile_no)+".shp")
processing.runalg("qgis:clip", input_file, clip_file, output_file)
QgsMapLayerRegistry.instance().removeMapLayer(clip_layer.id())

This would be fine except that my input file is 2GB and the polygon clipping file contains 400+ polygons. The resulting process takes over a week on my quad core machine. All the while three cores are just idling.


The solution I have in my head is to export the process out to script files and run them asynchronously using gnu parallel for example. However it seems a shame to have to drop out of QGIS into an OS specific solution rather than use something native to QGIS python. So my question is:


Can I parallelise embarrassingly parallel geographic operations natively inside python QGIS?


If not, then perhaps someone already has the code to send this sort of work off to asynchronous shell scripts?



Answer




Here is the gnu parallel solution. With some care most emabrrassingly parallel linux based ogr or saga algorithms could be made to run with it inside your QGIS installation.


Obviously this solution requires the installation of gnu parallel. To install gnu parallel in Ubuntu, for example, go to your terminal and type


sudo apt-get -y install parallel

NB: I couldn't get the parallel shell command to work in Popen or subprocess, which I would have preferred, so I hacked together an export to a bash script and ran that with Popen instead.


Here is the specific shell command using parallel that I wrapped in python


parallel ogr2ogr -skipfailures -clipsrc tile_{1}.shp output_{1}.shp input.shp ::: {1..400}

Each {1} gets swapped out for a number from the {1..400} range and then the four hundred shell commands get managed by gnu parallel to concurrently use all the cores of my i7 :).


Here is the actual python code I wrote to solve the example problem I posted. One could paste it in directly after the end of the code in the question.



import stat
from subprocess import Popen
from subprocess import PIPE
feature_count=tile_layer.dataProvider().featureCount()
subprocess_args=["parallel", \
"ogr2ogr","-skipfailures","-clipsrc",\
os.path.join(output_folder,"tile_"+"{1}"+".shp"),\
os.path.join(output_folder,"output_"+"{1}"+".shp"),\
input_file,\
" ::: ","{1.."+str(feature_count)+"}"]

#Hacky part where I write the shell command to a script file
temp_script=os.path.join(output_folder,"parallelclip.sh")
f = open(temp_script,'w')
f.write("#!/bin/bash\n")
f.write(" ".join(subprocess_args)+'\n')
f.close()
st = os.stat(temp_script)
os.chmod(temp_script, st.st_mode | stat.S_IEXEC)
#End of hacky bash script export
p = Popen([os.path.join(output_folder,"parallelclip.sh")],\

stdin=PIPE, stdout=PIPE, stderr=PIPE)
#Below is the commented out Popen line I couldn't get to work
#p = Popen(subprocess_args, stdin=PIPE, stdout=PIPE, stderr=PIPE)
output, err = p.communicate(b"input data that is passed to subprocess' stdin")
rc = p.returncode
print output
print err

#Delete script and old clip files
os.remove(os.path.join(output_folder,"parallelclip.sh"))

for i in range(feature_count):
delete_file = os.path.join(output_folder,"tile_"+str(i+1)+".shp")
nosuff=os.path.splitext(delete_file)[0]
suffix_list=[]
suffix_list.append('.shx')
suffix_list.append('.dbf')
suffix_list.append('.qpj')
suffix_list.append('.prj')
suffix_list.append('.shp')
suffix_list.append('.cpg')

for suffix in suffix_list:
try:
os.remove(nosuff+suffix)
except:
pass

Let me tell you it's really something when you see all the cores fire up to full noise :). Special thanks to Ole and the team that built Gnu Parallel.


It would be nice to have a cross platform solution and it would be nice if I could have figured out the multiprocessing python module for the qgis embedded python but alas it was not to be.


Regardless this solution will serve me and maybe you nicely.


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...