Tuesday 29 May 2018

database - Is there any way I can use a Key-Value store for geospatial data?



I have used many relational databases in the past, but I have also read about all NoSQL databases, and the Key-Value stores looks interresting.


When I store geometric object I mostly use five indexed columns ID, MIN_X, MAX_X, MIN_Y and MAX_Y (where X and Y are in a map projection). I don't need an index on my other data.


I need the X and Y values to lookup objects in a specified place (map rectangle), and I need the ID value if I want to update an specified object.


Is there any way that I can use a Key-Value store for this?



Answer



We use Google AppEngine to run spatial/attribute queries and the main issue (from day one) is how to index large sets of arbitrarily sized lines/polygons. Point data isn't too difficult (see geohash, geomodel etc) but sets of randomly clustered small/large polygons was always a problem (and in some cases, still is)


I've tried several different versions of spatial indexing on GAE but most are just variants of two below. None were as fast as SQL databases and all have pros/cons. the tradeoffs seems reasonable for most internet based mapping apps though. Also, the two below need to be coupled with in-memory geometry culling (via JTS etc) to remove any features that don't fit the final search parameters. and finally, they rely on GAE specific features but I'm sure it could be applied to other architectures (or use TyphoonAE to run on a linux cluster, ec2 etc)


Grids - Pack all the features for a certain area into a known grid index. Place a small spatial index on the grid so you quickly navigate the set of features that it contains. For most queries, you'll only need to pull a handful of grids which is fast, since you know the exact grid naming convention and how it related to K/V entities (gets, not queries)


Pros - pretty fast, easy to implement, no memory footprint.


Cons - preprocessing needed, user needs to decide grid size, large geoms are shared on several grids, clustering can cause the grids to become overloaded, serialization/deserialization costs can be an issue (even when compressed via protocol buffers)



QuadKeys - This is the current implementation. basically its the same as Grids except there is no set grid level. as features are added, they are indexed by the quadkey grid that completely contains their bounds (or in some cases, split into two when a single quadkey can't be used, think dateline). After the qk is found, its then split into a max number of smaller qk that provide finer grain representations of the feature. a pointer/bbox to that feature is then packed into a lightweight gridindex (group of features) that can be queried (an original design queried the features directly but this proved too slow/CPU intensive in cases where the resultset was large)


Polyline Quadkeys http://www.arc2earth.com/images/help/GAE_QKS_1.png Polygon Quadkeys http://www.arc2earth.com/images/help/GAE_QKS_2.png


The quadkey naming convention used above is well known and more importantly, tends to preserve locality (described more here )


The polygon above looks something like this: 0320101013123 03201010131212 03201010131213 0320101013132 0320101013133 03201010131302 03201010131303 032010101313002 032010101313003 032010101313012 032010101313013 03201010131312 03201010131313 032010101313102 ...


if the query bounds is small enough, you can directly fetch via the qk. this is optimal since its only a single, batch rpc call to the GAE datatore. if the bounds is large enough that it included too many possible qks (>1000) then you can alternatively query using a filter (ex: qk >= 0320101013 and qk <= 0320101013 + \ufffd ). The quadkey naming convention plus the way GAE indexes strings allows the query above to fetch only the existing grids that fall below that qk value.


there are other caveats and perf issues but in general, its the ability to query on the quadkeys that makes it feasible


examples - query on US counties: geojson


Pros - pretty fast, no grid size config, no memory footprint, no overcrowded grids


Cons - preprocessing needed, possible overfetch in some scenarios, no polar data


Space Filling Curves - Take a look at Alfred's NextGen Queries talk at Google I/O this year. The inclusion of generic space/time filling curves along with the new MultiQuery operators (run in parallel) will allow for some really cool spatial queries. Will it beat traditional SQL performance? Hard to say but it should scale really well. And we're rapidly approaching a future where always-on mobile devices of all shapes/sizes will dramatically ramp up the traffic to your site/service.



finally, I would also agree that you should look very closely at your problem domain before choosing NoSQL over SQL. In our case, I really liked the pricing model of GAE so there really wasn't a choice but if you do not need to scale, save yourself some time and just use a standard sql db


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...