Wednesday 20 June 2018

python - geopandas spatial join extremely slow


I am using the code below to find a country (and sometimes state) for millions of GPS points. The code currently takes about one second per point, which is incredibly slow. The shapefile is 6 MB.


I read that geopandas uses rtrees for spatial joins, making them incredibly efficient, but this does not seem to work here. What am I doing wrong? I was hoping for a thousand points per second or so.


The shapefile and csv can be downloaded here (5MB): https://www.dropbox.com/s/gdkxtpqupj0sidm/SpatialJoin.zip?dl=0


import pandas as pd
import geopandas as gpd
from geopandas import GeoDataFrame, read_file
from geopandas.tools import sjoin
from shapely.geometry import Point, mapping,shape
import time



#parameters
shapefile="K:/.../Shapefiles/Used/World.shp"
df=pd.read_csv("K:/.../output2.csv",index_col=None,nrows=20)# Limit to 20 rows for testing

if __name__=="__main__":
start=time.time()
df['geometry'] = df.apply(lambda z: Point(z.Longitude, z.Latitude), axis=1)
PointsGeodataframe = gpd.GeoDataFrame(df)

PolygonsGeodataframe = gpd.GeoDataFrame.from_file(shapefile)
PointsGeodataframe.crs = PolygonsGeodataframe.crs
print time.time()-start
merged=sjoin(PointsGeodataframe, PolygonsGeodataframe, how='left')
print time.time()-start
merged.to_csv("K:/01. Personal/04. Models/10. Location/output.csv",index=None)
print time.time()-start

Answer



adding the argument op='within' in the sjoin function speeds up the point-in-polygon operation dramatically.


Default value is op='intersects', which I guess would also lead to correct result, but is 100 to 1000 times slower.



No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...