Thursday 24 March 2016

python - OGR/GDAL threading results in low core utilization


I'm trying to process some raster data using ogr/gdal and I can't seem to get full utilization of all the cores on my machine. When I only run the process on a single core, I get 100% utilization of that core. When I try to split into multicore (in the example below, by chunking the x offsets and putting them in a queue), I get pathetic utilization on each of my 8 cores. It seems like it only adds up to 100% utilization across each core (e.g. 12.5% on each).


I was concerned that using the same datasource was the bottleneck, but I then I duplicated the underlying raster file for each core... and core utilization is still crap. This leads me to believe that ogr or gdal is somehow behaving like a bottleneck shared resource but I can't find anything online about that. Any help would be much appreciated!


This is the "helper" function that runs inside each Worker thread:


def find_pixels_intersect_helper(datasource, bounds_wkt, x_min, x_max):

bounds = ogr.CreateGeometryFromWkt(bounds_wkt)
rows_to_write = []
for x_offset in range(x_min, x_max):
for y_offset in range(datasource.RasterYSize):
pxl_bounds_wkt = pix_to_wkt(datasource, x_offset, y_offset)
pxl_bounds = ogr.CreateGeometryFromWkt(pxl_bounds_wkt)
if pxl_bounds.Intersect(bounds):
rows_to_write.append(['%s_%s' % (x_offset, y_offset), pxl_bounds.Centroid().ExportToWkt()])

Answer



OK. That was a day of my life that I'll never get back again. Turns out the problem was not in the code I posted above. That's totally fine. Turns out that this was a case of threading.Thread vs. multiprocessing.Process.



As pointed out in the python documentation:



The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine



Thus, threading.Thread is for IO-intensive operations, multiprocessing.Process is for CPU intensive operations. I switched to multiprocessing.Process and everything works great.


Check out this tutorial to learn how to use multiprocessing.Process


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...