versioning - Managing large amounts of geospatial data?

Tuesday, 27 September 2016

versioning - Managing large amounts of geospatial data?

How do you manage your geospatial data? I have terabytes of data spread out over hundreds of datasets, and have an ad-hoc solution using symbolic links within projects which link back to an domain-name based archive directory for each dataset. This works mostly, but has its own issues.

I'm also keen to hear if anyone manages their geospatial data in a revision control system; I currently use one for my code and small datasets, but not for full datasets.

Answer

I think the stock/obvious answer would be to use a spatial database (PostGIS, Oracle, SDE, MSSQL Spatial, etc) in conjunction with a metadata server such as esri's GeoPortal or the open source GeoNetwork application, and overall I think this is generally the best solution. However, you'll likely always have a need for project-based snapshots / branches / tags. Some of the more advanced databases have ways of managing these, but they're generally not all that easy to user/manage.

For things you store outside of a database (large images, project-based files) I think the key is to have a consistent naming convention and again a metadata registry (even something low-tech like a spreadsheet) that allows you to track them and ensure that they are properly managed. For instance, in the case of project-based files this can mean deleting them when records management policy dictates, or rolling them into the central repository on project completion.

I have seen some interesting solutions though...

Back when the BC Ministry of Environment was running things off of Arc/Info coverages, they had a really cool rsync-based two way synchronization process in place. The coverages that were under central control were pushed out to regions nightly, and regional data was pushed back in. This block-level differential transfer worked really well, even over 56k links. There were similar processes for replicating the Oracle-based attribute databases, but I don't think they they typically did too well over dial-up :)

My current place of work uses a similar hybrid solution. Each dataset has its authoritative copy (some in Oracle, others in MapInfo, others in personal geodatabases) and these are cross-ETL'd nightly using FME. There is some pretty major overhead here when it comes to maintenance though; the effort to create any new dataset and ensure organisational visibility is considerably higher than it should be. We're in the process of a review intended to find some way of consolidating to avoid this overhead.

Blog

Tuesday, 27 September 2016

versioning - Managing large amounts of geospatial data?

No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?