I am running ArcMap 10.2 via a python 2.7.5 interface (the python installation that came with ArcMap). My script takes in addresses, first geocodes them and then does a spatial join of them against a .shp file in order to identify the census tract associated with the addresses.
The script will tend to work fine for a while (maybe an hour or two). At top speeds, it will processes around 2 million addresses per hour. (I have about 100 million total to process). Periodically, however, the script will simply stall. CPU usage from Arcpy will go to essentially zero. No errors are returned, and the process is still technically running, but nothing is actually happening.
I reviewed this somewhat similar question: ArcPy script stops running randomly with no error returned . In my case, however, low ram does not seem to be an issue (my addresses are broken up into many smaller files, usually 500k addresses at a time tops), and I checked that I am running a 64 bit version of python. Also, my problem is somewhat different from that described in the above post, in that in my situation, when Arcpy stalls, CPU usage goes to 0, rather than staying high as in the other post.
UPDATE:
I am mainly a linux and mac user and so didn't notice the way that Windows reports errors from python script execution. I now have seen that after stalling for a while as described above, I will get a dialogue from windows reporting problems with the script and then once clicking on that to kill the process I get the following:
Unhandled Exception: System.IO.FileNotFoundException: Could not load file or assembly 'GpMetadataFunctions, Version=10.2.0.0, Culture=neutral, PublicKeyToken=8fc3cc631e44ad86' or one of its dependencies. The system cannot find the file specified.
File name: 'GpMetadataFunctions, Version=10.2.0.0, Culture=neutral, PublicKeyToken=8fc3cc631e44ad86'
at System.Reflection.Assembly._nLoad(AssemblyName fileName, String codeBase, Evidence assemblySecurity, Assembly locationHint, StackCrawlMark& stackMark, Boolean throwOnFileNotFound, Boolean forIntrospection)
at System.Reflection.Assembly.InternalLoad(AssemblyName assemblyRef, Evidence assemblySecurity, StackCrawlMark& stackMark, Boolean forIntrospection)
at System.Reflection.Assembly.InternalLoad(String assemblyString, Evidence assemblySecurity, StackCrawlMark& stackMark, Boolean forIntrospection)
at System.Reflection.Assembly.Load(String assemblyString)
at GetManagedType(Char* _assembly, Char* _path, Char* _className)
I saw some discussion here of a suggestion to add adding dlls to the Global Assembly Cache (GAC). I don't believe that I currently have such a cache, and I'm not quite certain:
(a) why it would be necessary and
(b) why issues with that would cause my script to sometimes work but sometimes not, which seems to differ from the situations discussed in the esri discussion.
Should I go ahead though and try to install a GAC and then move the given files to it? Being new to windows, I'm a bit reluctant to muck around with settings like this that I don't really understand, particularly given that it seems unclear if this is really at the heart of the issue.
SYSTEM DETAILS
Below I have posted (1) more details on the system I am using and (2) the script I am using.
The details in the python installation are:
Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit (AMD64)] on win32
This is on a windows virtual machine (via VirtualBox) being run on a host that is a mac pro 6,1 (late 2013), 64 GB ram, E5 processor at 3.5 GHz x 6 cores (each double threaded). I am allocating 10 virtual processors and 8 GB ram to the VM, and Arcpy is the only thing running on it apart from the op system. Task manager shows well < 50% of total ram for the VM being used at any one time. I could easily allocate more ram to the VM, but ram doesn't seem to be the constraint.
SCRIPT
# C:\Python27\ArcGISx6410.2\python.exe C:\Dropbox\GIS\geocoding.py
# Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit (AMD64)] on win32
print 'Importing Modules'
import arcpy
import os
from os import walk
import csv
from datetime import datetime, timedelta
arcpy.env.overwriteOutput = True
######################################
## Define some helper functions ##
## for formating progress reports ##
######################################
CF = lambda x: "{:,}".format(x) ## Format numbers into strings with commas
format_time = lambda x: str(x).split('.')[0]
def chop_microseconds(delta): return (delta - timedelta(microseconds=delta.microseconds))
##############################
## DEFINE LOCAL VARIABLES ##
##############################
input_dir = "C:\\Dropbox\\GIS\\Data\\ToBeMatched\\County_1990\\"
output_dir = "C:\\Dropbox\\GIS\\Data\\Matched\\County_1990\\"
db = "C:\\GIS\\GeoDB\\NewGeoDB.gdb\\"
print "db = %s" %db
address_locator = "C:\\GIS\\Data_and_Maps_for_ArcGIS2013_NO2_135752\\streetmap_na\\data\\Street_Addresses_US.loc"
address_fields = "'Single Line Input' Situs_Address"
print "address_fields = %s" %address_fields
jf = "C:\\GIS\\NHGIS_Tract_Shape_Files\\us_tract_1990\\us_tract_1990.shp"
Completed_Files = set([F.split('_')[0] + ".csv" for F in os.listdir(output_dir) if '.csv' in F])
print 'Starting Work'
## get list of filenames (csvs) in input directory ##
def getFileNames():
filenames = []
for(dirpath, dirnames, files) in walk(input_dir):
filenames.extend(files)
return filenames
filenames = getFileNames() # list of csv file names
filenames = [F for F in filenames if '.csv' in F]
print 'Filenames = '
print filenames[1:10]
for fdx, filename in enumerate(filenames):
if filename in Completed_Files: continue
print '\n(%d of %d) Reading from %s, current time: %s' %(fdx + 1, len(filenames), filename, format_time(datetime.now()))
##########################################
## Specify FileNames for Intermediate ##
## and Final Results ##
##########################################
T0 = datetime.now()
address_table = input_dir + filename ## input file name
geocode_result = os.path.join(db, filename[:-4] + "_geocodeop")
ofc = os.path.join(db, filename[:-4] + "_joinop")
outputcsv = output_dir + filename[:-4] + "_finalop.csv" # name of final csv output
#### GEOCODING PROCESS #####
arcpy.GeocodeAddresses_geocoding(address_table, address_locator, address_fields, geocode_result, 'STATIC')
GoeCode_Time = datetime.now() - T0
GoeCode_Time = chop_microseconds(GoeCode_Time)
print "Finished Geocoding! current time: %s, Time to GeoCode = %s" %(format_time(datetime.now()), GoeCode_Time)
# Process: Spatial Join With Census Tract Shape File
arcpy.SpatialJoin_analysis(
target_features = geocode_result,
join_features = jf,
out_feature_class = ofc,
join_operation = "JOIN_ONE_TO_ONE",
match_option = "WITHIN",
join_type = "KEEP_COMMON" # Change to KEEP_ALL if we want to keep data that has no match
)
# Process: Export to CSV
fields = arcpy.ListFields(ofc)
field_names = [field.name for field in fields]
Lines_Written = 0
print "Writing to %s, current time = %s" %(outputcsv, format_time(datetime.now()))
with open(outputcsv, 'wb') as f:
w = csv.writer(f)
w.writerow(field_names[-11:])
for row in arcpy.SearchCursor(ofc):
field_vals = [row.getValue(field.name) for field in fields[-11:]]
w.writerow(field_vals)
del row
Lines_Written += 1
T1 = datetime.now()
Hours = timedelta.total_seconds(T1-T0)/3600
Lines_per_Hour = int(round(Lines_Written / Hours))
Lines_per_Hour = CF(Lines_per_Hour)
print "Total Time = %s, Lines Written = %s, Addresses Per Hour: %s" %(chop_microseconds(T1-T0), CF(Lines_Written), Lines_per_Hour)
Well, this isn't a full or completely satisfying answer, so I'd be delighted to accept another better answer if someone has one. But, it seems that if for any reason Arcpy encounters an error while it is working on executing a script, and this causes the script to stall, terminate, or need to be externally terminated, then some kind of corruption occurs in the Geodatabase that was being used.
Sometimes, it is enough to simply start again with a file that has a different name than the one you were initially trying to read from (even if this just means renaming your file to something else). In general, however, a more reliable method seems to be to create a whole new Geodatabase and delete the old one. I've had pretty good luck with this procedure resolving this issue for me.