I am currently writing a python script using the arcgisscripting module to process a reasonably large data set (~10,000 records in total) normalised over a small number of tables, 8 in total. The process consists of creating a feature based on coordinate tuples (x,y) and creating a graph (nodes and lines) using the relationships in the other 7 tables for guidance. The final output is a personal geodatabase (pgdb/fgdb) with nodes and edges spatial data sets that visually represent the relationships.
My initial attempt was to use queries of the new geodatabase tables and SearchCursor record sets to populate link tables (InsertCursor) for the many-to-many relationships that occur. This worked very well, except for the 15-20 min processing time.
Using the cProfiler module in Python, it was apparent that 'thrashing' a personal geodatabase when performing the search queries to populate the link tables with requests for cursors (Search and Insert cursors) caused the appalling performance.
With a little refactoring I have managed to get the processing time below 2.5 minutes. The trade off was partial construction of the geodatabase schema in code and limiting the requests for arcgisscripting cursors to InsertCursors once all relationships were collated.
My question is one of performance;
- What techniques have people used to maintain reasonable compute times when working with large data set?
Are there any ESRI recommended methods that I've missed in my search for optimisation?
I understand the overhead incurred when creating a arcgisscripting cursor, particularly if it is from a personal geodatabase, though after a lengthy search for performance related answers from this site and Google I am under the impression that performance isn't at the forefront of peoples endeavours.
- As a user of ESRI products, does one expect and condone these performance lags?
UPDATE
After some work with this product I have accumulated a list of optimization techniques that have taken a process of converting spatial information from a propriety format to a geodatabase. This has been developed for personal and file geodatabase. Tidbits:
Read you data and rationalize it in memory. This will cut your time in half.
Create feature classes and tables in memory. Use the feature dataset keywork 'in_memory' to use your memory as ram disk, perform your functions there and then write out to disk
To write out to disk use the CopyFeatureclass for feature classes, and CopyRow for tables.
These 3 things took a script that converted 100,000+ features to a geodatabase from 30 minutes to 30 - 40 seconds, this includes relationship classes. They are not to be used lightly, most of the methods above use a lot of memory, which could cause you issues if you are not paying attention.
Answer
Although this question was already answered, I thought I could chime in an give my two cents.
DISCLAIMER: I worked for ESRI at the GeoDatabase team for some years and was in charge of maintaining various parts of GeoDatabase code (Versioning, Cursors, EditSessions, History, Relationship Classes, etc etc).
I think the biggest source of performance problems with ESRI code is not understanding the implications of using different objects, particularly, the "little" details of the various GeoDatabase abstractions! So very often, the conversation switches to the language being used as a culprit of the performance issues. In some cases it can be. But not all the time. Let's start with the language discussion and work our way back.
1.- The programming language that you pick only matters when you are doing something that is complicated, in a tight loop. Most of the time, this is not the case.
The big elephant in the room is that at the core of all ESRI code, you have ArcObjects - and ArcObjects is written in C++ using COM. There is a cost for communicating with this code. This is true for C#, VB.NET, python, or whatever else you are using.
You pay a price at initialization of that code. That may be a negligible cost if you do it only once.
You then pay a price for every subsequent time that you interact with ArcObjects.
Personally, I tend to write code for my clients in C#, because it is easy and fast enough. However, every time I want to move data around or do some processing for large amounts of data that is already implemented in Geoprocessing I just initialize the scripting subsystem and pass in my parameters. Why?
- It is already implemented. So why reinvent the wheel?
- It may actually be faster. "Faster than writing it in C#?" Yes! If I implement, say, data loading manually, it means that I pay the price of .NET context switching in a tight loop. Every GetValue, Insert, ShapeCopy has a cost. If I make the one call in GP, that entire data loading process will happen in the actual implementation of GP - in C++ within the COM environment. I don't pay the price for context switching because there is none - and hence it is faster.
Ah yes, so then the solution if to use a lot of geoprocessing functions. Actually, you have to be careful.
2. GP is a black box that copies data (potentially unnecessarily) around
It is a doubled-edged sword. It is a black box that does some magic internally and spits out results - but those results are very often duplicated. 100,000 rows can easily be converted into 1,000,000 rows on disk after you ran your data through 9 different functions. Using only GP functions is like creating a linear GP model, and well...
3. Chaining too many GP functions for large datasets is highly inefficient. A GP Model is (potentially) equivalent to executing a query in a really really really dumb way
Now don't get me wrong. I love GP Models - it saves me from writing code all the time. But I am also aware that it is not the most efficient way of processing large datasets.
Have you every heard of a Query Planner? It's job is to look at the SQL statement you want to execute, generate an execution plan in the form of a directed graph that looks a heck of a lot like a GP Model, look at the statistics stored in the db, and choose the most optimal order to execute them. GP just executes them in the order you put things because it doesn't have statistics to do anything more intelligently - you are the query planner. And guess what? The order in which you execute things is very dependent on your dataset. The order in which you execute things can make the difference between days and seconds and that is up to you to decide.
"Great" you say, I will not script things myself and be careful about how I write stuff. But do you understand GeoDatabase abstractions?
4.Not understanding GeoDatabase abstractions can easily bite you
Instead of pointing out every single thing that can possibly give you a problem, let me just point out a few common mistakes that I see all the time and some recommendations.
- Understanding the difference between True/False for Recycling cursors. This tiny little flag set to true can make runtime orders of magnitude faster.
- Put your table in LoadOnlyMode for data loads. Why update the index on every insert?
- Understand that even though IWorkspaceEdit::StartEditing looks the same in all workspaces, they are very different beasts on every datasource. On an Enterprise GDB, you may have versioning or support for transactions. On shapefiles, it will have to be implemented in a very different way. How would you implement Undo/Redo? Do you even need to enable it (yes, it can make a difference in memory usage).
- The difference between batch operations, or single row operations. Case in point GetRow vs GetRows - this is the difference between doing a query to get one row or doing one query to fetch multiple rows. A tight loop with a call to GetRow means horrible performance and it is culprit #1 of performance issues
- Use UpdateSearchedRows
- Understand the difference between CreateRow and CreateRowBuffer. Huge difference in insert runtime.
- Understand that IRow::Store and IFeature::Store triggers super heavy polymorphic operations. This is probably reason #2 culprit of really slow performance. It doesn't just save the row, this is the method that makes sure your geometric network is OK, that the ArcMap Editor gets notified that a row has changed, that notifies all relationship classes that have anything to do with this row validate to make sure that the cardinality is valid, etc. You should not be inserting new rows with this, you should be using an InsertCursor!
- Do you want (need) to do those inserts in an EditSession? It makes a huge difference if you do or not. Some operations require it (and make things slower), but when you don't need it, skip the undo/redo features.
- Cursors are expensive resources. Once you have a handle to one, you are guaranteed that you will have Consistency and Isolation and that has a cost.
- Cache other resources like database connections (don't create and destroy your Workspace reference) and Table handles (every time you open or close one - several metadata tables need to be read).
- Putting FeatureClasses inside or outside a FeatureDataset makes a huge difference in performance. It is not meant as an organizational feature!
5.And last and not least...
Understand the difference between I/O bound and CPU bound operations
I honestly thought about expanding more on every single one of those items and perhaps doing a series of blog entries that covers every single one of those topics, but my calendar's backlog list just slapped me in the face and started yelling at me.
My two cents.
No comments:
Post a Comment