Thursday, 12 May 2016

arcpy - Streamlining Python Code for Big Data


I have Python code which is designed to take point shapefiles through the following workflow:



  1. Merge points

  2. Integrate points, such that any points within 1 m of each other become one point

  3. Create Feature layer, where points with z < 10 are selected

  4. Buffer Points

  5. Polygon to raster 1m resolution

  6. Reclassify, where 1 - 9 = 1; NoData = 0



Each shapefile has approximately 250,000 to 350,000 points covering ~ 5x7 km. Point data used as inputs represent tree locations. Each point (i.e. tree) has an associated "z" value which represents crown radius and is used in the buffer process. My intent is to use the final binary output in a separate process to produce a raster describing canopy cover.


I ran a test with four shapefiles and it produced a 700MB raster and took 35 minutes (i5 processor & 8GB RAM). Seeing that I will need to run this process on 3500 shapefiles, I would appreciate any advice to streamline the process (see attached code).Generally speaking, what is the best way to deal with geoprocessing big data? More specifically, are there any tweaks to the code or workflow that could help increase efficiency?


Edit:


Time (% of total) for geoprocessing tasks:



  • Merge = 7.6%

  • Integrate = 7.1%

  • Feature to Lyr = 0

  • Buffer = 8.8%

  • Poly to Raster = 74.8%


  • Reclassify = 1.6%


enter image description here


# Import arcpy module
import arcpy

# Check out any necessary licenses
arcpy.CheckOutExtension("spatial")

# Script arguments

temp4 = arcpy.GetParameterAsText(0)
if temp4 == '#' or not temp4:
temp4 = "C:\\gdrive\\temp\\temp4" # provide a default value if unspecified

Reclassification = arcpy.GetParameterAsText(1)
if Reclassification == '#' or not Reclassification:
Reclassification = "1 9 1;NODATA 0" # provide a default value if unspecified

Multiple_Value = arcpy.GetParameterAsText(2)
if Multiple_Value == '#' or not Multiple_Value:

Multiple_Value = "C:\\t1.shp;C:\\t2.shp;C:\\t3.shp;C:\\t4.shp" # provide a default value if unspecified

# Local variables:
temp_shp = Multiple_Value
Output_Features = temp_shp
temp2_Layer = Output_Features
temp_Buffer = temp2_Layer
temp3 = temp_Buffer

# Process: Merge

arcpy.Merge_management(Multiple_Value, temp_shp, "x \"x\" true true false 19 Double 0 0 ,First,#,C:\\#########omitted to save space

# Process: Integrate
arcpy.Integrate_management("C:\\gdrive\\temp\\temp.shp #", "1 Meters")

# Process: Make Feature Layer
arcpy.MakeFeatureLayer_management(temp_shp, temp2_Layer, "z <10", "", "x x VISIBLE NONE;y y VISIBLE NONE;z z VISIBLE NONE;Buffer Buffer VISIBLE NONE")

# Process: Buffer
arcpy.Buffer_analysis(temp2_Layer, temp_Buffer, "z", "FULL", "ROUND", "NONE", "")


# Process: Polygon to Raster
arcpy.PolygonToRaster_conversion(temp_Buffer, "BUFF_DIST", temp3, "CELL_CENTER", "NONE", "1")

# Process: Reclassify
arcpy.gp.Reclassify_sa(temp3, "Value", Reclassification, temp4, "DATA")

Answer



Some algorithm changes that should help you.


Execute your selection first before the merge or integrate. This will cut down significantly on the later functions that are most expensive.


Merge and integrate are both memory expensive, so you want to keep eliminating features as you bring in feature classes, and try to do your merges in a binary tree to keep the size of the merges and integrates down. e.g. for four shapefiles you merge two shapefiles and integrate; merge two more shapefiles and integrate; merge the two resulting feature classes and integrate.



Your job queue starts as a queue of shapefile references. You also have a result queue to place results into. The run() method for your parallel processing worker will do these operations: Take two items off the queue. If no item is taken (queue is empty), terminate the worker. If one item is taken, put that item straight into the result queue.


If two items are taken, for each item: if it is a shapefile, select for z < 10 and create an in_memory feature class; else, it is already an in_memory feature class and skips the selection step. Merge the two in_memory feature classes to create a new in_memory feature class. Delete the original two feature classes. Execute integrate on the new feature class. Place that feature class into the result queue.


Then run an outer while loop. The loop starts with the shapefile queue and tests for length greater than 1. It then runs the queue through the workers. If the result queue has a length greater than 1, the while loop executes another parallel processing run through the workers until the result queue is 1 in_memory feature class.


e.g. If you start with 3500 shapefiles, your first queue will have 3500 jobs. Second will have 1750 jobs. 875, 438, 219, 110, 55, 28, 14, 7, 4, 2, 1. Your big bottleneck will be memory. If you do not have enough memory (and you will run out of memory in the create of the first result queue if that is the case), then modify your algorithm to merge more than 2 feature classes at once then integrate, which will cut down the size of your first result queue in exchange for longer processing time. Optionally, you could write output files and skip using in_memory feature classes. This will slow you down considerably, but would get past the memory bottleneck.


Only after you have performed merge and integrate on all of the shapefiles, ending with one single feature class, do you then perform the buffer, poly to raster, and reclassify. That way those three operations are performed only once and you keep your geometry simple.


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...