Monday 1 June 2015

arcpy - Scaling DA UpdateCursor to large datasets?


I am currently having what appears to be a scaling issue with arcpy, da update cursors, and large file geodatabases.


I have a piece of code that iterates through every feature in a feature class and does some calculations and manipulation on the data. It works great for smaller data sets, but it is orders of magnitude slower on large ones. I have a simple counter in it and print statements every 1000 records for the purpose of benchmarking it and here are the results:



  • 48k features, each 1000 takes ~0.12 seconds to do

  • 133k features, each 1000 takes ~0.12 seconds to do

  • 2mil features, each 1000 takes ~0.17 seconds to do

  • 48mil features, each 1000 takes ~23 seconds to do (yes, twenty three seconds, no decimal place)


I only let the 48mil run for a few minutes before I killed it but the rest of them were ran until completion and these times are very consistent, be it the first or the last one with only a few hundredths of a second deviation once in a while. Even if they were scaling linearly I would only expect .05 seconds every 2mil records and that would put 48mil somewhere in the 1.3 second area. The results are 20 times that.



The ones with less features are just subsets of the data for testing purposes so it should not be an issue with differences in the data. All of them were created in the same methods, so I do not think there is an issue with differences in the data other than size. It seems to be coming purely from the number of features.


Sorry I don't have the exact code with me at the moment, it is at work, but this is basically it. The code itself works great though, I am mostly wondering if anyone as run into similar issues with large datasets. I was thinking that it could be a memory leak error, but these are all run outside of Arc in the python window and would expect that a memory leak would take time to ramp up (slow down) over time rather than instantly being much slower.



import arcpy
from datetime import datetime

...

i = 0
fromlast = datetime.now()

with arcpy.da.UpdateCursor(fc, fields) as rows:
for row in rows:
if i % 1000 == 0:
now = datetime.now()
print i, ': ', now - fromlast
fromlast = now
###do stuff here
rows.updateRow(row)
i += 1


I only have ArcGIS (ArcInfo license), gdal/ogr2ogr, and python to work with, but I am not set on using the FGDB or da cursors if there is some better way to do it within my limited selection of tools.



Answer



Does creating an index seem to have any impact on run time?


Also, you can use a SQL statement to restrict the values you are accessing. If you really do need to access a very large amount of data, you can nest the SQL statement and cursor within a while loop.


Edit: I've run some benchmarks on my machine (Windows 7 x64, 2.4GHz i3-370m (lol), 8GB RAM, ArcMap 10.1 SP1). I created a feature class of 25,000,000** rows with a field, x, containing sequential integers. I'm assuming that update cursors are slower, so that is why I tested a search cursor.


import arcpy, time    
shp = "C:/images/junk/massive.gdb/large"
entries = int(arcpy.GetCount_management(shp).getOutput(0))
test1 = []
startval = 0

breakval = 1000000

c = time.clock()
while startval < entries:
sql = '"OBJECTID" BETWEEN {0} AND {1}'.format(startval+1, startval+breakval)
test1.extend([row[0] for row in arcpy.da.SearchCursor(shp, "x", sql)])
startval+=breakval
print time.clock()-c

c = time.clock()

test2 = [row[0] for row in arcpy.da.SearchCursor(shp, "x")]
print time.clock()-c

print test1 == test2

The results were as follows:


614.128610407
601.801415697
True


This would place my read time at ~41,000 records/s or 1,000 records in ~24.4 µs.


**I ran test1 with 50,000,000 features and it took 1217. Couldn't run test2 as I received a memory overflow error. But regardless, I doubled the features and the time roughly doubled, which is encouraging.


I'm not sure the cutoff point for you where the access time skyrockets, but if this is something that you will be running often, it's worth the time to optimize breakval. To do that, increase/decrease breakval and just access a subset of your FC with SQL once (the entire FC clearly takes too long). If total run time*entries/breakval decreases between runs, then you can continue tweaking breakval.


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...