My question is: what might cause the ArcGIS 9.3 Hydrology Toolbox Snap Pour Point and Watershed tools to run significantly slower in a Python script versus manually running the tools in ArcCatalog (both runs using the exact same input data files and specification parameters)??
I'm running this script from the windows command line and an IDE (Komodo) so neither ArcMap or ArcCatalog are open during those runs; I'm using ArcGIS 9.3.1 on a Windows XP machine with 4 GB of RAM and plenty of processor power (a Dell Precision machine); and all input data are local; all data are in the same projection.
I have pre-processed Flow Accumulation, Flow Direction, and a Feature Class of water-quality sample sites to be used as pour points, and all datasets are in NAD83 UTM 17N projection. The Flow Accum and Flow Dir data files have spatial extents based on Hydrologic Unit Code (HUC) eight digit boundaries. The script iterates through each record in the sample site feature class, determines which eight digit HUC the point falls in, selects and exports the site record as a new feature class in a scratch geodatabase, and then handles the watershed processing using that exported point feature class.
I've stripped out comments and some other calls to a "processing log timestamp" function for reporting tool run times just to focus here on the question at hand and have posted the focus section of the script below. The biggest time sinks based on the processing log reporting are the Snap Pour Point tool and Watershed tool. We're talking on the order of 4+ minutes per site using a subset of all the sites for testing, and then 3-4 minutes for each site with the Watershed tool.
However, I can take the same single point feature class and the same flow accumuluation and flow direction rasters that the Python script is using, and run the Snap Pour Point and Watershed tools manually in ArcCatalog in approx 1 minute or less for each site (the resulting watersheds are the same as the watersheds the script produces). Example: one of the sites I manually ran the Snap Pour Point tool in 58 seconds via ArcCatalog, while the Python script executing that tool took 3 minutes 59 seconds for the tool to complete. I then ran the Watershed tool manually for the same site and it took 1 minute 1 second in ArcCatalog, while the Python script executing the Watershed tool took 4 minutes 8 seconds for the tool to complete. I'm using a 50 meter buffer for the Snap Pour Point tool in both the Python script and during my ArcCatalog manual run comparisons.
As you can see, CONSIDERABLY slower in the command line Python script. I've tried running this in an IDE (Komodo) and then via the Windows command line...same results with both. I've hit a wall with what I can figure out, so much appreciate any suggestions or insight.
try:
if hucGdb in gdbFiles:
outScratch = scratch + '\\' + delinSiteID + 'temp'
flowAccum = filePath + '\\' + hucGdb + '\\' + 'FA_' + huc
flowDir = filePath + '\\' + hucGdb + '\\' + 'FD_' + huc
tempEnvironment = gp.extent
gp.extent = flowAccum
clause = '[DELIN_ID] = ' + "'" + delinSiteID + "'"
gp.Select_analysis(delinSites, outScratch, clause)
gp.extent = tempEnvironment
outSnapRaster = snapPoint + '\\' + 'p' + delinSiteID
gp.SnapPourPoint_sa(outScratch, flowAccum, outSnapRaster, tolerance)
gp.extent = tempEnvironment
outSnapFeature = snapPoint + '\\' + 'p' + delinSiteID + '_snap'
gp.RasterToPoint_conversion(outSnapRaster, outSnapFeature, "VALUE")
gp.AddField_management(outSnapFeature, "ORIGINAL_SITE_ID", "TEXT", "#", "#", "20")
gp.CalculateField_management(outSnapFeature, "ORIGINAL_SITE_ID", '"' + realID + '"')
gp.AddField_management(outSnapFeature, "DELIN_ID", "TEXT", "#", "#", "20")
gp.CalculateField_management(outSnapFeature, "DELIN_ID", '"' + delinSiteID + '"')
outWsRaster = scratch + '\\' + delinSiteID
tempEnvironment = gp.extent
gp.extent = flowDir
gp.Watershed_sa(flowDir, outSnapRaster, outWsRaster)
gp.extent = tempEnvironment
outWsFeature = basinOutput + '\\' + delinSiteID
gp.RasterToPolygon_conversion(outWsRaster, outWsFeature, "SIMPLIFY")
gp.AddField_management(outWsFeature, "DELIN_ID", "TEXT", "#", "#", "25")
gp.CalculateField_management(outWsFeature, "DELIN_ID", '"' + delinSiteID + '"')
gp.AddField_management(outWsFeature, "ORIGINAL_SITE_ID", "TEXT", "#", "#", "25")
gp.CalculateField_management(outWsFeature, "ORIGINAL_SITE_ID", '"' + realID + '"')
else:
msg = """Site ID """ + realID + """ does not have a geodatabase of
data, moving onto the next site"""
print msg
logFile.write(msg + '\n')
except:
print "ARCGISSCRIPTING ERROR: " + gp.GetMessages(2)
msg = "ARCGISSCRIPTING ERROR: " + gp.GetMessages(2)
logFile.write(msg + '\n')
print "ERROR: " + ErrorDesc.message
msg = "ERROR: " + ErrorDesc.message
logFile.write(msg + '\n')
EDIT
Just to see what happens, I set up a new ArcToolbox Toolbox and added my python script as a script that I could run. I have the script running based on argument input from a "configuration" text file that it reads in, so in this fashion I just double click on the script and click "OK" as there are no variables to set. I ran the script via ArcToolBox in ArcCatalog, not ArcMap.
The script ran in 7 minutes 32 seconds vs. 20+ minutes for my same three test sites (same file formats, everything), about one-third of the total time when running the script as a standalone.
Running the script in ArcCatalog definitely shaved off a bunch of time on the Snap Pour Point and Watershed tool runs (everything else was already reasonably fast so not worried about other tools/functions) based on my time stamp (tool started...tool finished) reporting I have built in.
SO....it seems that there is something going on (or perhaps in this case not going on) when you run a python script outside of ArcGIS (via command line or an IDE) that incurs considerable overhead on raster-calculation tools like Watershed and Snap Pour Point
I don't have any special settings (at least that I recognize/am aware of) in the ArcToolBox environments. As far as I know, these are the defaults (I've never messed with them - see attached screenshot). How would these (or a lack of these defaults) impact such tools - is that the issue? If so, can they be set in standalone python code??
I think this is getting closer.
Answer
So through discussions on another forum I was able to figure out why the standalone script was running so terribly slow. This is because I did not have the gp.ScratchWorkspace property set. Once I set that to mirror my ArcToolBox General Environment setting using the same directory, I was able to run the Python script from the command line in approx. 6 minutes 49 seconds, about 40 seconds quicker than the run I did using the script set up in ArcCatalog (using all the same input/output files and formats as I had been).
Someone had also suggested the gp.LoadSettings method to pull in the Environmental settings from an XML file, which I tried doing by saving out the Environment from ArcCatalog to XML and then pointing to that XML file with the latter gp method. It kept returning an error and saying the object could not be found, and I think this is a known bug as I found this separate discussion forum which alludes to this problem in the context of ArcObjects (but seems the same thing was happening with Python): http://forums.arcgis.com/threads/7204-igeoprocessor.savesettings-and-igeoprocessor.loadsettings-not-persisting-settings
So, this gp.ScratchWorkspace was exactly the "type" of setting I was hoping for that would solve the problem (a 2 second fix that gives you a "d'oh!" moment but let's you quickly forge ahead), something that on the surface seems not a big deal but in the end it more than tripled the total run time for the script when executed from the command line outside of ArcGIS.
No comments:
Post a Comment