Maximizing Code Performance for Shapely

Saturday, 27 January 2018

Maximizing Code Performance for Shapely

I've written this code in order to calculate the percent cover of grassland within a national park, within 20km^2 of points of recorded occurrences for a species. It is designed to only consider the areas within the park, and not those outside. It works, except for one huge issue... it's incredibly slow. I think I've tracked it down to the avail_lc_type = pt_buffer.intersection(gl).area. The gl polygon has > 24,000 polygons in it. It is a raster to polygon transformation (it has been dissolved), so it has a lot of little polygons within. Right now it's not a huge problem since I'm running it on ~300 points (still takes > 1 hr), but I plan to run it on a few million later, so I need to improve it.

Any ideas?

import numpy as np
import pprint
import shapely
from shapely.geometry import*
import fiona

from fiona import collection
import math

traps = fiona.open('some_points.shp', 'r') #point file: focal points around which statistics are being derived

study_area = fiona.open('available_areas.shp', 'r') #polygon file: represents area available for analysis
for i in study_area: #for every record in 'study_area'
        sa = shape(i['geometry']) #make a variable called 'sa' that is a polygon

grassland = fiona.open('land_cover_type_of_interest.shp', 'r') #polygon file: want to calculate percent cover of this lc type within study_area, and within areaKM2 (next variable) of each focal point

pol = grassland.next()
gl = MultiPolygon([shape(pol['geometry']) for pol in grassland]) 

areaKM2 = 20 #hyp home range size of specie of interest

with traps as input:
    #calculate initial area in meters, set radius
    areaM2 = areaKM2 * 1000000
    r = (math.sqrt(areaM2/math.pi))
    #begin buffering and calculating available area (i.e. within study area) for each point

    for point in input:
        pt_buffer = shape(point['geometry']).buffer(r)
        avail_area = pt_buffer.intersection(sa).area
        #check and adjust radius of buffer until it covers desired available area within study area
        while avail_area < areaM2:
            r += 300
            pt_buffer = shape(point['geometry']).buffer(r)
            avail_area = pt_buffer.intersection(sa).area
        #then, calulate percent cover of land cover type of interest within adjusted buffer area
        #print to check

        avail_lc_type = pt_buffer.intersection(gl).area
        perc_cov = (avail_lc_type/areaM2) * 100
        print perc_cov

Using the answer from @MWrenn I was able to profile my code and came up with this:

55.3555590078
         415 function calls (365 primitive calls) in 48.633 seconds

   Ordered by: standard name


   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001   48.632   48.632 :15(neighb_func)
        1    0.000    0.000   48.633   48.633 :1()
        2    0.000    0.000    0.001    0.000 :531(write)
        1    0.000    0.000    0.000    0.000 __init__.py:1118(debug)
        1    0.000    0.000    0.000    0.000 __init__.py:1318(getEffectiveLevel)
        1    0.000    0.000    0.000    0.000 __init__.py:1332(isEnabledFor)
        1    0.000    0.000    0.000    0.000 _abcoll.py:483(update)
        3    0.000    0.000    0.000    0.000 _weakrefset.py:68(__contains__)

        1    0.000    0.000    0.000    0.000 abc.py:128(__instancecheck__)
        1    0.000    0.000    0.000    0.000 abc.py:148(__subclasscheck__)
       11    0.000    0.000    0.000    0.000 base.py:195(_is_empty)
       11    0.003    0.000    0.003    0.000 base.py:202(empty)
        7    0.000    0.000    0.003    0.000 base.py:212(__del__)
       25    0.000    0.000    0.000    0.000 base.py:231(_geom)
        2    0.000    0.000    0.000    0.000 base.py:235(_geom)
        3    0.000    0.000    0.000    0.000 base.py:313(geometryType)
        3    0.000    0.000    0.000    0.000 base.py:316(type)
        3    0.000    0.000    0.001    0.000 base.py:383(area)

        2    0.000    0.000    0.001    0.000 base.py:443(buffer)
        8    0.000    0.000    0.000    0.000 base.py:46(geometry_type_name)
        5    0.000    0.000    0.000    0.000 base.py:52(geom_factory)
        3    0.000    0.000   48.626   16.209 base.py:529(intersection)
       10    0.000    0.000    0.000    0.000 brine.py:106(_dump_int)
        2    0.000    0.000    0.000    0.000 brine.py:150(_dump_str)
     12/2    0.000    0.000    0.000    0.000 brine.py:179(_dump_tuple)
     24/2    0.000    0.000    0.000    0.000 brine.py:202(_dump)
        2    0.000    0.000    0.000    0.000 brine.py:332(dump)
     10/2    0.000    0.000    0.000    0.000 brine.py:360(dumpable)

     14/8    0.000    0.000    0.000    0.000 brine.py:369()
        2    0.000    0.000    0.000    0.000 channel.py:56(send)
        1    0.000    0.000    0.000    0.000 collection.py:186(filter)
        1    0.000    0.000    0.000    0.000 collection.py:274(__iter__)
        1    0.000    0.000    0.000    0.000 collection.py:364(closed)
        1    0.000    0.000    0.000    0.000 collections.py:37(__init__)
       25    0.000    0.000    0.000    0.000 collections.py:53(__setitem__)
        4    0.000    0.000    0.000    0.000 compat.py:17(BYTES_LITERAL)
        2    0.000    0.000    0.000    0.000 coords.py:20(required)
        2    0.000    0.000    0.000    0.000 geo.py:20(shape)

        5    0.000    0.000    0.000    0.000 geos.py:484(errcheck_predicate)
        8    0.000    0.000    0.000    0.000 impl.py:43(__getitem__)
        2    0.000    0.000    0.000    0.000 point.py:124(_set_coords)
        2    0.000    0.000    0.000    0.000 point.py:188(geos_point_from_py)
        2    0.000    0.000    0.000    0.000 point.py:37(__init__)
        2    0.000    0.000    0.001    0.000 protocol.py:220(_send)
        2    0.000    0.000    0.001    0.000 protocol.py:227(_send_request)
        2    0.000    0.000    0.000    0.000 protocol.py:241(_box)
        2    0.000    0.000    0.001    0.000 protocol.py:438(_async_request)
        2    0.000    0.000    0.000    0.000 stream.py:173(write)

       11    0.000    0.000    0.000    0.000 topology.py:14(_validate)
        3    0.000    0.000    0.000    0.000 topology.py:33(__call__)
        3   48.626   16.209   48.626   16.209 topology.py:40(__call__)
        2    0.001    0.000    0.001    0.000 topology.py:57(__call__)
        5    0.000    0.000    0.000    0.000 utf_8.py:15(decode)
        5    0.000    0.000    0.000    0.000 {__import__}
        5    0.000    0.000    0.000    0.000 {_codecs.utf_8_decode}
        3    0.000    0.000    0.000    0.000 {_ctypes.byref}
      6/2    0.000    0.000    0.000    0.000 {all}
        6    0.000    0.000    0.000    0.000 {getattr}

        5    0.000    0.000    0.000    0.000 {globals}
       10    0.000    0.000    0.000    0.000 {hasattr}
        5    0.000    0.000    0.000    0.000 {isinstance}
       31    0.000    0.000    0.000    0.000 {len}
        5    0.000    0.000    0.000    0.000 {locals}
        1    0.000    0.000    0.000    0.000 {math.sqrt}
        2    0.000    0.000    0.000    0.000 {method 'acquire' of 'thread.lock' objects}
       24    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       28    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}

        1    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {method 'lower' of 'str' objects}
        5    0.000    0.000    0.000    0.000 {method 'pack' of 'Struct' objects}
        2    0.000    0.000    0.000    0.000 {method 'release' of 'thread.lock' objects}
        2    0.000    0.000    0.000    0.000 {method 'send' of '_socket.socket' objects}
        2    0.000    0.000    0.000    0.000 {next}

I'm still trying to figure out exactly what it means, but I think it's telling me that the intersection function is taking ~16 seconds each time it's called. So, per @gene 's suggestion, I'm working on using a spatial index. I just have to figure out how to get libspatialindex going so that I can use Rtrees.

Answer

You need to use a spatial index. Without an spatial index, you must iterate through all the geometries. With a bounding spatial index, you iterate only through the geometries which have a chance to intersect the other geometries.

Popular bounding spatial indexes in Python:

Examples:

In your script:

Why
- from fiona import collectionif you import fiona and use only fiona?
- import numpy as np if you are not using Numpy but math?
- import shapely if you specify from shapely.geometry import *?

If sa is a list, then use sa = [shape(i['geometry']) for i in for i in study_area]. If not, you need a condition; otherwise sa will be the last feature of study_area.

Why pol = grassland.next() if you read the shapefile with MultiPolygon([shape(pol['geometry']) for pol in grassland])?

Some suggestions:

You use the grassland and trap variables once, so you don't need them

gl = MultiPolygon([shape(pol['geometry']) for pol in fiona.open('land_cover_type_of_interest.shp')])

and

for point in fiona.open('some_points.shp'):

Blog

Saturday, 27 January 2018

Maximizing Code Performance for Shapely

No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?