Friday, 2 March 2018

versioning - Implementing version control system for geospatial data?



Not that I am in any immediate need of a right answer here, but I've lately seen some efforts to introduce the concept of "(distributed) version control systems" for geographic data. Some examples (that I know of) are the three whitepapers from OpenGeo (1, 2 & 3) and the "Geosynkronisering (geosyncronization)" project by Norwegian GIS Software vendors and the Norwegian Mapping Agency. I've also found Distributed versioning of geospatial data?, which mentions GeoGit (by OpenGeo), and Applying version control to ArcGIS ModelBuilder models? about version control in ArcGIS.


Being a developer I know (at least enough to be able to use them) how version control systems for source code (like SVN and Git) works, and my background in geomatics tells me that there are some unique challenges with geographical data that makes the approach not completely similar to the way source code (which is basically text) are handled.


What are the challenges when dealing with (d)VCS'es for geographical data, how would you solve them, do we need them and are there other attempts to solve these issues than the ones I have mentioned?


I know that the OpenGeo whitepapers will answer some of my questions, but what I'm really is after is a more "pedagogical" answer, in the style of "tell me like I'm a 10-year-old", so that I can refer people to a great explanation of the challenges and solutions that geographical data brings to the mix.



I hope that someone with some insight will take time to provide some thoughts on the matter, as I said I'm not currently looking to solve a particular problem, but this topic is one that interests me.



Answer



We are currently working on a complete redesign of our geodatastores. I have to say that their evolution took more than 20 years till now. We identified the following key features in geospatial data management:



  • simultaneous editing

  • permissions to read or write portions of data

  • hot update while running services that rely on data (Transactions and ACID paradigm)

  • internal and external schema (modifying the internal schema should not affect services)

  • ability to store and access large amounts of data (terabytes of raster and hundrets of gigabytes of vector data)

  • data consistency between different layers (every parcel belongs to a district and so on)



We evaluated the following approaches, here is what I can say about them:




  1. ESRI Enterprise Geodatabase (ArcGIS 10.1); pretty much the same what we had before (SDE), but with extensive use of the versioning feature to handle transactions. But it simply isn't really an Enterprise Geodatabase, in my opinion SDE only works in a workgroup as a geodata server, where people work from 8:00 am to 8:00 pm, and you can take it offline then for maintenance tasks, transaction commiting (versioning reconcile and post in ESRI speech), replication, etc... If you build services on top of this data you have to handle a staging database (where work is done) an a replicated production database. This is pretty much the same like build/test and deploy in programming. While the feature rich package ESRI delivers is quite nice, it lacks on flexbility (schema changes, or maintenance tasks while people are working, index creation for example). In my opinion it inhibits to much things that make sense for me as an administrator.




  2. Flat Files and a Version Control System we choose Git (Didn't know that there is already a GeoGit). Oh yeah, some of my pals and myself also come from software engineering. It could all be so simple. I think thats its problem: Its like a car mechanic building a car. It will be simple to maintain for him, but it also will be annoying to drive and damn ugly to look at for sure. I think it also has some major disadvantages: How to version control a 2 TeraByte (or even more, binary) Rasterdataset? And in which Format? Vector data can be easily version controlled if you use textbased formats (GML for example), but its also hard to work with a billion rows dataset then. I am still not sure if we can do effective user permission management, because not everybody should be allowed to edit or even view everything. And how do you merge a vector dataset which was edited intensively by 4 users at the same time? At least you have to be a real computer scientist/programmer to do all this effectively...our GIS users are planners, surveyours, geologists, and so on. It simply is a problem for them to think of version lineages like programmers do, or use branching the way its supposed to. Nevertheless, thinking of datastores as shared repos is an interesting idea.





  3. Flat tabled database as a simple container. The same as SDE does it, but without the SDE stuff. Still hard to maintain, because you actually do not make use of the advantages a RDBMS offers you. Yes its very simple to just load everything in a database, but thats not data management at all.




  4. Bigdata and NoSQL. Same problems as flat files and flat tables. In my opinion a simple filesystem API for use in the web. Yes it works well in the web, and yes its easy to just throw your documents in, but I think if I want to accomplish spatial data analysis on terabytes of (possibly raster) data, I like to have it not to be serialized and deserialized over the HTTP interface.




UPDATE 2018: Here is a lot of new stuff creating a lot of momentum. To name a few:



  • Cloud block storage and HDFS

  • Python/shapely/Dask Stack



  • Apache Spark



    • GeoMesa/GeoWave for Vector data

    • GeoTrellis for raster data




  • and much more




    1. Comprehensive classic database modeling (with a RDBMS). The main problem is, that its hard to simply drop data anywhere and hope that it fits every future need then. But if you put an amount of time to specify a robust datamodel (OSM also did this in fact) in a database, you are able to make use of all its advantages. We can edit and update data in distributed transactions, we also can modify their core schema, while services still rely on external schemas of the same data, we can maintain it, we can check its consistency, we can allow and deny permissions, we are able to store very large amounts of data while we still can access it in a fast manner, we are able to build historising datamodels and trigger it transparently and so on. Because we use sql server we are still lacking a native raster type, but other database vendors already offer this.




Well I think the relational database model just get up in the spatial world with spatial data types in the last couple of years (before that it where BLOB Containers) and is still the most flexible and professional form of storing data. That doesn't mean that it should not be supplemented with VCS approaches or NoSQL but I see these approaches more likely as a form of data distribution in groups of users than as a form of professional centralized spatial data management. Also OSM has centralized a lot of tasks, which the crowd just can not provide, like importing large amounts of data (most OSM data in Austria was imported in a day, not crowdsourced) and tile generation. The collaborative (crowd sourcing) part is indeed very important, but its only half the business.


I think I have to rephrase a lot of this and provide more facts. A question like this is hard to answer comprehensively in a couple of hours, I will try to improve quality of my answer the next days


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...