ogr - Python: how to speed up spatial search (nearest point)?

Sunday, 12 March 2017

ogr - Python: how to speed up spatial search (nearest point)?

I'm working on a program that determines the closest location from a given point. The point cloud I'm testing against is very big (~800.000). However, this doesn't really explain why my implementation is so slow. This is my approach:

First, I created a spatial index for the point shape

pntshp.ExecuteSQL('CREATE SPATIAL INDEX ON %s' % table_name)

I defined an array of buffer distances to narrow down the search radius. Which of course also means that I have to create a buffer for each point (which might be expensive).

BUFFER_DISTANCES = ( 0.001, 0.005, 0.01, 0.02, 0.05 ) # 100m, 500m, 1km, 2km, 5km

Then, the buffer is used as a spatial filter

node_lyr.SetSpatialFilter(buff)

If the filter returns None the buffer distance will be increased.

for buffer_d in BUFFER_DISTANCES:
    buffr = get_buffer(xy_street,buffer_d)

    ...

Then I am calculating the distance to the points returned by the spatial filter

p=ogr.Geometry(ogr.wkbPoint)
p.AddPoint(xy[0],xy[1])

for feat in node_lyr:
    geom = feat.GetGeometryRef()
    d = p.Distance(geom)
    dist.append(d)

To get the closest point:

def get_closest_pnt(dist, node, how_close):
    mrg = zip(dist,node)
    mrg.sort(key=lambda t: t[0])
    try:
        return mrg[how_close]
    except IndexError, ierr:
        print '%s \ndist/node tuple contain %s' % (ierr,mrg)

It all works fine but is really slow. Creating a spatial index didn't show any effect, really. To calculate 100 points this implementations takes ~6,7 seconds. The program needs to be able to calculate the closest location for more than 2000 points as fast as possible. Any ideas on how to improve my approach?

EDIT

I tried different approaches to see where it gets me. I came across something very astonishing I want to share here.

I implemented a simple lookup algorithm as described here, and one of the solutions that where suggested (the sorted set approach).

The surprising fact is that performance is not only dependent on the implementation but even more so of the OSX. My original ogr/buffer algorithm turns out to be blazing fast on my OSX whereas it is painstaking slow on Linux (hence the question here).

Here are my results (100 runs).

Method       |     OSX        |  Linux Ubuntu
ogr buffer   | 0:00:01.434389 | 0:01:08.384309

sub string   | 0:00:19.714432 | 0:00:10.048649


sorted set   | 0:00:01.239999 | 0:00:00.600773


Specs Mac OSX
Processor 4x2.5 GHz  
Memory 8 GB 1600 MHz

Specs Dell Linux Ubuntu
Processor 8x3.4GHz

Memory 7.8 GB

If someone can explain why these differences occur, please don't hesitate.

Answer

Avoiding the spatial query

Since you noted buffering is computationally expensive and may be holding you back, consider this approach: Start looping through each point and round off your lat long point to a decimal place within your buffer (i.e. if your lat/long is 12.3456789/12.3456789 then get all points that begin with a lat/long of 12.34567/12.34567 or 12.34568/12.34568 or 12.34567/12.34568 or 12.34568/12.34567). Use a hash table to do this. Take this subset of points, get all distances to your input point, and the point with minimum distance is the one you want. Creating a lookup methodology will make this very efficient.

This avoids having to do expensive spatial queries and query filter setup 800,000 times. You would only be doing string/double comparisons and distance calculations in this method. The only downside that I could see to this method is that each decimal roundoff is an order of magnitude above the last, so if your spatial query didn't return any points, rounding down again may return many more points than you need, which would slow you down a bit. However, you have at least two orders of magnitude in your orignal BUFFER_DISTANCES, so I think this method may suffice for your purposes and would certainly be faster than the method you have going right now.

The hash table:

Here's a more concise and better introductory explanation to hash tables.

The concept is like so: You want to look up the definition for the word "Yes" in the dictionary. You don't look through every word in a dictionary starting from A to find the words that start with Ye, correct? You jump straight to the Y section first, then you look for the page that says Ye-Yo and then you scan all the words on that page to get the definition for Yes.

The lookup methodology to loop through all the lat/long points would be implemented in this same fashion. You'd look first for all the lat/longs that start with 12.3 in a range from, lets say 0 to 99. Then you'd look in those 10 values for 12.34, and so on. If programmed correctly, you can return a "bucket" with all of the points within your buffer, without having to execute a single spatial query or string/double comparison!

Finally, it should be noted that if you store your indexed table in a RDBMS, there may be optimization for this already. If your lat/long values are doubles and do a simple BETWEEN SQL query, it will likely have its search function already optimized to do this (if your query is written correctly).

Blog

Sunday, 12 March 2017