I am batch processing a large amount of distance functions (~50,000) using ArcGIS 9.3. In this version, using spatial analyst functions, it seems to be the case that users cannot parallelize processes across cores, or write to in_memory instead of the hard disk for intermediate steps, and there's a memory leak inherent in the program that causes processing to slow down over multiple loops. I've seen a forum post that shows a crude workaround for the memory leak that just turns off garbage collection, but that alone probably won't help a whole lot. Right now I'm looking at processing times on the order of a few weeks on the computer I'm using for this.
Does anyone know of any reliable work-arounds that would allow me to write SA output to in_memory, parallelize processes, or in some other way significantly speed up processing? I'm open to using ERDAS if I have to.
Script:
# Import system modules
import sys, string, os, arcgisscripting, time, shutil
# Create the Geoprocessor object
gp = arcgisscripting.create(9.3)
# Allow overwrites
gp.overwriteoutput = 1
# ----------------------------------------------------------
# Set changeable model parameters
# ----------------------------------------------------------
cell_sz = 90.04245658
Max_oid = 1 # Should be = 54567 for irrigated agriculture, in province sources - Range function starts at 0 and stops 1 before number in range - so this number should be = number of records in shapefile
Win_sz = 3100 # This is the minimum possible size, and should make it so that even the largest k results in values = 0.001 or less at boundary
# Create a folder for this set of runs, designated by the current time, then copy the script being run to the folder and time stamp it
t = time.localtime()
timeholder = str(t[0]) + '_' + str(t[1]) + '_' + str(t[2]) + '_' + str(t[3]) + str(t[4]) + '_' + str(t[5])
folder_time_name = "M:\\Jess_GIS_NIDR\\" + timeholder
os.mkdir(folder_time_name)
filename=sys._getframe().f_code.co_filename
shutil.copyfile(filename, folder_time_name + '\\this_script_executed_' + timeholder + '.py') #shutil
# Check out any necessary licenses
gp.CheckOutExtension("spatial")
cell_sz_str = str(cell_sz) + " METERS"
gp.cellSize = cell_sz
# Set up loop to go through the data set, iteratively select cells out, and processs distance calculations
for i in range(Max_oid):
pt = i
print "Processing point number " + str(pt)
# Load prepped dataset containing this point
irragg = "M:\\Jess_GIS_NIDR\\Pts_irragg\\pt_int_" + str(pt) + ".shp"
# Publish output to common folders - may add additional dists_proc style folders
Dists_raw = "in_memory\\dists_raw"
# Define local variables (files only associated with processing this source cell)
DEM = "M:\\Jess_GIS_NIDR\\InputRasters_Globcover_Hydrosheds\\HydroSHEDS\\dem"
Dir_hydro = "M:\\Jess_GIS_NIDR\\InputRasters_Globcover_Hydrosheds\\HydroSHEDS\\drain_dir"
Dists_fin = folder_time_name + "\\Dists\\Dists_fin.shp"
# Identify and set processing extent
y = []
x = []
cur = gp.SearchCursor(irragg)
row = cur.Next()
while row:
extent = row.Shape.Extent
y.append(extent.ymin)
x.append(extent.xmin)
row = cur.Next()
del row
del cur
ypt = y.pop(0)
xpt = x.pop(0)
def incmax(a): return (a + Win_sz)
def incmin(b): return (b - Win_sz)
ymin = incmin(ypt)
ymax = incmax(ypt)
xmin = incmin(xpt)
xmax = incmax(xpt)
del y
del x
gp.extent = "%f %f %f %f" % (xmin, ymin, xmax, ymax)
# Path-distance function
gp.PathDistance_sa(irragg, Dists_raw, "", DEM, Dir_hydro, "FORWARD 0.01 0.5", "", "", "", "")
# Convert distance results to permanent shape file
gp.RasterToPoint_conversion(Dists_raw, Dists_fin, "VALUE")
# Further processing of original distance raster to get at first order ideas of distance
# ---------------------------------------------------------------------------------------
# Crudely correct for the fact that the original land-use data was originally at a coarser resolution than this dataset
gp.AddField_management(Dists_fin, "Dcorr", "DOUBLE", "", "", "", "", "NULLABLE", "", "")
gp.CalculateField_management(Dists_fin, "Dcorr", "[RASTERVALU] - 90.04245658", "", "")
# First order distance function
gp.AddField_management(Dists_fin, "Dist_k", "DOUBLE", "", "", "", "", "NULLABLE", "", "")
gp.CalculateField_management(Dists_fin, "Dist_k", "Exp([Dcorr]*-0.00231)", "", "")
Answer
A couple of general tricks I have found useful in the past in this situation:
- Run your Python script as stand-alone (e.g. from IDLE, PyWin, Eclipse or preferably CMD) to remove the overhead of ArcMap.
- Spawning subprocesses is an old trick to solving ArcGIS memory leaks even if you don't want to parallize a process. It works because the memory is released each time the subprocess is closed, whereas the memory is not correctly released in your current code as long as your Python process is open. You can also 'paralellize' processes by spawning subprocesses in this way (see more discussion below). You CAN use multiple cores in ArcGIS 9.x in this way (I've done it on a 16 core machine and used all 16 cores for a single script).
You can have a more sophisticated control of your subprocesses easily by using concurrent.futures
if you can use Python 3.x (you can upgrade the version of Python used by Arc but given the differences between Python 3.x and 2.x, that is probably not an option for you) or you can use multithreading in older version (Before anybody complains that ArcGIS 9.x is not multithreaded, remember that it is not ArcGIS you are multithreading but Python).
However, you do NOT want to multithread the actual geoprocessing task itself as this will not help speed things up ironically. Python has the dreaded GIL (Global Interpreter Lock) which prevents it from using more than one core when multithreading. For geoprocessing this is bad news because it would cram a heap of CPU-intensive geoprocessing tasks onto a single core because they are all running through the same interpreter. What I do is multithread (or preferably use concurrent.futures because it is easier) a bit of code that in turn spawns a subprocess. The subprocess is not constrained by the GIL (only the threads are). This sounds complicated but really isn't and it allows you to use the the nice process queuing/workers relationship provided by multithreading while giving you full access to the total processing power of your PC.
To maintain control, you can use subprocess.call
instead of subprocces.popen
to force your thread to wait on the subprocess finishing. You may want to do this if a subsequent task relies on the output of a given subprocess. You will also want to limit the number of subprocesses because a subprocess is independent of the thread unless you have a call-back, so a single thread could spawn hundreds of simultaneous subprocesses if you are not careful. BUT more parallelization does not necessarily equate to more speed for a given machine. You can actually end up slowing things down! If this is a task which will be repeated a lot then a little profiling can go a long way. Geoprocessing tasks tend to be quite CPU/memory intensive and I have settled on a rule of thumb to limit my geoprocessing subprocesses to one less than the number of cores my PC has (assuming I have a quad core or better - and how I miss the 16-core beast I had in one company). This allows some room for the operating system and the Python interpreter itself.
No comments:
Post a Comment