Wednesday 20 February 2019

python - Speed up row-wise point in polygon with Geopandas


I have a geopandas GeoDataframe which contains some attributes and a geometry column which is filled with shapely Point(lon, lat) objects. Moreover, there is a GeoSeries which is equally indexed as the GeoDataframe and which contains shapely Polygon() objects. The polygons are derived from the commonly known world borders shapefile which means that many of them are rather large and complicated.


My goal is to join both GeoDataframes on a common attribute and then do a point in polygon test for each feature-pair, giving back all rows from the Points GeoDataframe for which the result is true. I think my approach given below works, however, it is quite slow and I wonder if there is any way to speed this up significantly.


In order to test this under similar conditions, you could download the World Country Points shapefile from the bottom of this page to get approx. 180.000 points as well as World Borders shapefile to get the polygons. Save them in the same folder as your Python script, then load them into GeoDataFrames with


import geopandas as gpd

print('Reading points...')
PointsGeodataframe = gpd.GeoDataFrame.from_file('Global_Points_2013March12_HIU_USDoS.shp')
print('Reading polygons...')

PolygonsGeodataframe = gpd.GeoDataFrame.from_file('TM_WORLD_BORDERS-0.3.shp')

At first, I merge both dataframes into one while joining on the columns storing the two digit ISO 3166-1 alpha-2 codes, keeping keys from the Points GeoDataframe.


print('Merging GeoDataframes...')
merged = PointsGeodataframe.merge(PolygonsGeodataframe, left_on='iso_alpha2', right_on='ISO2', how='left')
print(merged.head(5))

Out:


CC   NAME_x REGION_x                                    geometry_x  \
0 AG ALGERIA AFRICA POINT (-2.056197991715294 35.07505011900781)

1 AG ALGERIA AFRICA POINT (-1.93859720224998 35.08791119378083)
2 AG ALGERIA AFRICA POINT (-2.116729200249893 35.09038023113033)
3 AG ALGERIA AFRICA POINT (-2.136513829499904 35.09662084358041)
4 AG ALGERIA AFRICA POINT (-2.169619420782624 35.10288084768303)

iso_alpha2 iso_alpha3 iso_num tld AREA FIPS ISO2 ISO3 LAT LON \
0 DZ DZA 12 .dz 238174 AG DZ DZA 28.163 2.632
1 DZ DZA 12 .dz 238174 AG DZ DZA 28.163 2.632
2 DZ DZA 12 .dz 238174 AG DZ DZA 28.163 2.632
3 DZ DZA 12 .dz 238174 AG DZ DZA 28.163 2.632

4 DZ DZA 12 .dz 238174 AG DZ DZA 28.163 2.632

NAME_y POP2005 REGION_y SUBREGION UN \
0 Algeria 32854159 2 15 12
1 Algeria 32854159 2 15 12
2 Algeria 32854159 2 15 12
3 Algeria 32854159 2 15 12
4 Algeria 32854159 2 15 12

geometry_y

0 POLYGON ((2.96361 36.802216, 2.981389 36.80693...
1 POLYGON ((2.96361 36.802216, 2.981389 36.80693...
2 POLYGON ((2.96361 36.802216, 2.981389 36.80693...
3 POLYGON ((2.96361 36.802216, 2.981389 36.80693...
4 POLYGON ((2.96361 36.802216, 2.981389 36.80693...

To do a row-wise check if the Point (stored in geometry_x) is located within the polygon (stored in geometry_y) and get back all rows where this is the case, boolean indexing can be used (the geometry_x and geometry_y columns must be transformed into GeoSeries in order to use shapely's within on them:


rowsWherePointIsInPolygon = merged[gpd.GeoSeries(merged['geometry_x']).within(gpd.GeoSeries(merged['geometry_y']))]
print(rowsWherePointIsInPolygon)


This works fine for a small amount of rows. However, when doing this for the example shapefiles, it gets stuck for about half an hour on a rather fast machine and even longer on slow ones.


Is there any possibility to speed up this operation or is it already as fast as it gets? I don't think that there is any sense here to use a spatial index because the problem is not to find out to which of 400.000 polygons one point belongs to but to check if one point is in one polygon for a lot of points.


Without really knowing if this has something to do with it, I also tried setting


from shapely import speedups
speedups.enable()

but this did not make any change.




No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...