Saturday 27 January 2018

Maximizing Code Performance for Shapely


I've written this code in order to calculate the percent cover of grassland within a national park, within 20km^2 of points of recorded occurrences for a species. It is designed to only consider the areas within the park, and not those outside. It works, except for one huge issue... it's incredibly slow. I think I've tracked it down to the avail_lc_type = pt_buffer.intersection(gl).area. The gl polygon has > 24,000 polygons in it. It is a raster to polygon transformation (it has been dissolved), so it has a lot of little polygons within. Right now it's not a huge problem since I'm running it on ~300 points (still takes > 1 hr), but I plan to run it on a few million later, so I need to improve it.


Any ideas?


import numpy as np
import pprint
import shapely
from shapely.geometry import*
import fiona

from fiona import collection
import math

traps = fiona.open('some_points.shp', 'r') #point file: focal points around which statistics are being derived

study_area = fiona.open('available_areas.shp', 'r') #polygon file: represents area available for analysis
for i in study_area: #for every record in 'study_area'
sa = shape(i['geometry']) #make a variable called 'sa' that is a polygon

grassland = fiona.open('land_cover_type_of_interest.shp', 'r') #polygon file: want to calculate percent cover of this lc type within study_area, and within areaKM2 (next variable) of each focal point

pol = grassland.next()
gl = MultiPolygon([shape(pol['geometry']) for pol in grassland])

areaKM2 = 20 #hyp home range size of specie of interest

with traps as input:
#calculate initial area in meters, set radius
areaM2 = areaKM2 * 1000000
r = (math.sqrt(areaM2/math.pi))
#begin buffering and calculating available area (i.e. within study area) for each point

for point in input:
pt_buffer = shape(point['geometry']).buffer(r)
avail_area = pt_buffer.intersection(sa).area
#check and adjust radius of buffer until it covers desired available area within study area
while avail_area < areaM2:
r += 300
pt_buffer = shape(point['geometry']).buffer(r)
avail_area = pt_buffer.intersection(sa).area
#then, calulate percent cover of land cover type of interest within adjusted buffer area
#print to check

avail_lc_type = pt_buffer.intersection(gl).area
perc_cov = (avail_lc_type/areaM2) * 100
print perc_cov



Using the answer from @MWrenn I was able to profile my code and came up with this:


55.3555590078
415 function calls (365 primitive calls) in 48.633 seconds

Ordered by: standard name


ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 48.632 48.632 :15(neighb_func)
1 0.000 0.000 48.633 48.633 :1()
2 0.000 0.000 0.001 0.000 :531(write)
1 0.000 0.000 0.000 0.000 __init__.py:1118(debug)
1 0.000 0.000 0.000 0.000 __init__.py:1318(getEffectiveLevel)
1 0.000 0.000 0.000 0.000 __init__.py:1332(isEnabledFor)
1 0.000 0.000 0.000 0.000 _abcoll.py:483(update)
3 0.000 0.000 0.000 0.000 _weakrefset.py:68(__contains__)

1 0.000 0.000 0.000 0.000 abc.py:128(__instancecheck__)
1 0.000 0.000 0.000 0.000 abc.py:148(__subclasscheck__)
11 0.000 0.000 0.000 0.000 base.py:195(_is_empty)
11 0.003 0.000 0.003 0.000 base.py:202(empty)
7 0.000 0.000 0.003 0.000 base.py:212(__del__)
25 0.000 0.000 0.000 0.000 base.py:231(_geom)
2 0.000 0.000 0.000 0.000 base.py:235(_geom)
3 0.000 0.000 0.000 0.000 base.py:313(geometryType)
3 0.000 0.000 0.000 0.000 base.py:316(type)
3 0.000 0.000 0.001 0.000 base.py:383(area)

2 0.000 0.000 0.001 0.000 base.py:443(buffer)
8 0.000 0.000 0.000 0.000 base.py:46(geometry_type_name)
5 0.000 0.000 0.000 0.000 base.py:52(geom_factory)
3 0.000 0.000 48.626 16.209 base.py:529(intersection)
10 0.000 0.000 0.000 0.000 brine.py:106(_dump_int)
2 0.000 0.000 0.000 0.000 brine.py:150(_dump_str)
12/2 0.000 0.000 0.000 0.000 brine.py:179(_dump_tuple)
24/2 0.000 0.000 0.000 0.000 brine.py:202(_dump)
2 0.000 0.000 0.000 0.000 brine.py:332(dump)
10/2 0.000 0.000 0.000 0.000 brine.py:360(dumpable)

14/8 0.000 0.000 0.000 0.000 brine.py:369()
2 0.000 0.000 0.000 0.000 channel.py:56(send)
1 0.000 0.000 0.000 0.000 collection.py:186(filter)
1 0.000 0.000 0.000 0.000 collection.py:274(__iter__)
1 0.000 0.000 0.000 0.000 collection.py:364(closed)
1 0.000 0.000 0.000 0.000 collections.py:37(__init__)
25 0.000 0.000 0.000 0.000 collections.py:53(__setitem__)
4 0.000 0.000 0.000 0.000 compat.py:17(BYTES_LITERAL)
2 0.000 0.000 0.000 0.000 coords.py:20(required)
2 0.000 0.000 0.000 0.000 geo.py:20(shape)

5 0.000 0.000 0.000 0.000 geos.py:484(errcheck_predicate)
8 0.000 0.000 0.000 0.000 impl.py:43(__getitem__)
2 0.000 0.000 0.000 0.000 point.py:124(_set_coords)
2 0.000 0.000 0.000 0.000 point.py:188(geos_point_from_py)
2 0.000 0.000 0.000 0.000 point.py:37(__init__)
2 0.000 0.000 0.001 0.000 protocol.py:220(_send)
2 0.000 0.000 0.001 0.000 protocol.py:227(_send_request)
2 0.000 0.000 0.000 0.000 protocol.py:241(_box)
2 0.000 0.000 0.001 0.000 protocol.py:438(_async_request)
2 0.000 0.000 0.000 0.000 stream.py:173(write)

11 0.000 0.000 0.000 0.000 topology.py:14(_validate)
3 0.000 0.000 0.000 0.000 topology.py:33(__call__)
3 48.626 16.209 48.626 16.209 topology.py:40(__call__)
2 0.001 0.000 0.001 0.000 topology.py:57(__call__)
5 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
5 0.000 0.000 0.000 0.000 {__import__}
5 0.000 0.000 0.000 0.000 {_codecs.utf_8_decode}
3 0.000 0.000 0.000 0.000 {_ctypes.byref}
6/2 0.000 0.000 0.000 0.000 {all}
6 0.000 0.000 0.000 0.000 {getattr}

5 0.000 0.000 0.000 0.000 {globals}
10 0.000 0.000 0.000 0.000 {hasattr}
5 0.000 0.000 0.000 0.000 {isinstance}
31 0.000 0.000 0.000 0.000 {len}
5 0.000 0.000 0.000 0.000 {locals}
1 0.000 0.000 0.000 0.000 {math.sqrt}
2 0.000 0.000 0.000 0.000 {method 'acquire' of 'thread.lock' objects}
24 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
28 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}

1 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects}
2 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
2 0.000 0.000 0.000 0.000 {method 'lower' of 'str' objects}
5 0.000 0.000 0.000 0.000 {method 'pack' of 'Struct' objects}
2 0.000 0.000 0.000 0.000 {method 'release' of 'thread.lock' objects}
2 0.000 0.000 0.000 0.000 {method 'send' of '_socket.socket' objects}
2 0.000 0.000 0.000 0.000 {next}

I'm still trying to figure out exactly what it means, but I think it's telling me that the intersection function is taking ~16 seconds each time it's called. So, per @gene 's suggestion, I'm working on using a spatial index. I just have to figure out how to get libspatialindex going so that I can use Rtrees.



Answer




You need to use a spatial index. Without an spatial index, you must iterate through all the geometries. With a bounding spatial index, you iterate only through the geometries which have a chance to intersect the other geometries.


Popular bounding spatial indexes in Python:



Examples:



In your script:



  1. Why

    • from fiona import collectionif you import fiona and use only fiona?


    • import numpy as np if you are not using Numpy but math?

    • import shapely if you specify from shapely.geometry import *?



  2. If sa is a list, then use sa = [shape(i['geometry']) for i in for i in study_area]. If not, you need a condition; otherwise sa will be the last feature of study_area.

  3. Why pol = grassland.next() if you read the shapefile with MultiPolygon([shape(pol['geometry']) for pol in grassland])?


Some suggestions:


You use the grassland and trap variables once, so you don't need them


gl = MultiPolygon([shape(pol['geometry']) for pol in fiona.open('land_cover_type_of_interest.shp')]) 


and


for point in fiona.open('some_points.shp'):

No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...