Thursday 16 June 2016

Using geohash for proximity searches?


I'm looking to optimize point proximity geo searches time.


My input is lat,lng point and I'm searching on a precomputed set of locations to n nearest points.


I don't care how much time/space the building of the precomputed index of locations will take but I do care the queries will be super fast.


I'm thinking about using geohash as the search key, where I would first check if I get results for X chars of the key and then continue to trim down chars from the end of the key until I start to see results.


To my (very sparse for now) understanding of geo index techniques this approach should be able to produce the fastest results (in terms of query time) compared to all other known implementations (such as R Tree and co.)



Answer



Absolutely you can. And it can be quite fast. (The intensive computation bits can ALSO be distributed)



There are several ways, but one way that I've been working with is in using an ordered list of integer-based geohashes, and finding all the nearest neighbour geohash ranges for a specific geohash resolution (the resolution approximates your distance criteria), and then querying those geohash ranges to get a list of nearby points. I use redis and nodejs (ie. javascript) for this. Redis is super fast and can retrieve ordered ranges very quickly, but it can't do a lot of the indexing query manipulation stuff that SQL databases can do.


The method is outlined here: https://github.com/yinqiwen/ardb/wiki/Spatial-Index


But the gist of it is (to paraphrase the link):



  1. You store all your geohashed points in the best resolution you want (max usually 64bit integer if that's accessible, or in the case of javascript, 52bits) in an ordered set (ie. zset in redis). Most geohash libraries these days have geohash integer functions built in, and you'll need to use these instead of the more common base32 geohashes.

  2. Based on the radius you want to search within, you need to then find a bit depth/resolution that will match your search area and this must be less than or equal to your stored geohash bit depth. The linked site has a table that correlates the bit depth of a geohash to its bounding box area in meters.

  3. Then you rehash your original coordinate at this lower resolution.

  4. At that lower resolution also find the 8 neighbour (n, ne, e, se, s, sw, w, nw) geohash areas. The reason why you have to do the neighbour method, is because two coordinates nearly right beside each other could have completely different geohashes, so you need to do some averaging of the area covered by the search.

  5. Once you get all the neighbour geohashes at this lower resolution, add to the list your coordinate's geohash from step 3.

  6. Then you need to build a range of geohash values to search within which cover these 9 areas. The values from step 5 are your lower range limit, and if you add 1 to each of them, you'll get your upper range limit. So you should have an array of 9 ranges, each with a lower limit and and upper geohash limit (18 geohashes in total). These geohashes are still in that lower resolution from step 2.


  7. Then you convert all 18 of these geohashes to whatever bit depth/resolution you have stored all your geohashes in your database in. Generally you do this by bitshifting it to the desired bit depth.

  8. Now you can do a range query for points within these 9 ranges and you'll get all points approximately within the distance of your original point. There will be no overlap so you don't need to do any intersections, just pure range queries, very fast. (ie. in redis: ZRANGEBYSCORE zsetname lowerLimit upperLimit, over the 9 ranges produced in this step)


You can further optimize (speed wise) this by:



  1. Taking those 9 ranges from step 6 and finding where they lead into each other. Usually you can reduce 9 separate ranges into about 4 or 5 depending on where your coordinate is. This can reduce your query time by half.

  2. Once you have your final ranges, you should hold them for reuse. The calculation of these ranges can take most of the processing time, so if your original coordinate doesn't change much but you need to make the same distance query over again, you should keep that ready instead of calculating it everytime.

  3. If you're using redis, try to combine the queries into a MULTI/EXEC so it pipelines them for a bit better performance.

  4. The BEST part: You can distribute steps 2-7 on clients instead of having that computation done all in one place. This greatly reduces CPU load in situations where millions of requests would be coming in.



You can further improve accuracy by using a circle distance/haversine type function on the returned results if you care much about precision.


Here's a similar technique using ordinary base32 geohashes and a SQL query instead of redis: https://github.com/davetroy/geohash-js


I don't mean to plug my own thing, but I've written a module for nodejs&redis that makes this really easy to implement. Have a look at the code if you'd like: https://github.com/arjunmehta/node-georedis


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...