Friday, 11 March 2016

python - Aggregating Polygons To Meet Privacy Requirements


I have a point feature class representing office locations of all employers in a certain industry. The feature class has an attribute for storing the number of employees working at each office. Someone has requested to use this data, spatially-joined to the smallest geographic unit possible — Census Blocks, in this case. However, a privacy agreement prevents the data's release as-is. Instead, it must be suppressed to meet two criteria:



  1. Any polygon must contain at least 3 employers (points);

  2. No more than 80% of the total employment within a polygon may be by a single employer.


I have successfully written a script that spatially joins the points to Census Blocks, keeping the sum and maximum employment in each. Each one that does not meet the suppression criteria is flagged. (Polygons containing no points are not flagged, since there is no data to suppress.) I then check each Block Group to see whether any flagged Blocks are contained within it. Block Groups containing only un-flagged Blocks are then replaced with the Blocks. The resulting features class is then checked against the suppression criteria, to check whether the Block Groups have adequately suppressed the data.



The same process is repeated for Tracts, leaving me with a dataset consisting of Tracts (some flagged and some not), Block Groups and Blocks (all un-flagged). The next progression in the geographic hierarchy, however, is the county, which is of no use to the person requesting this data.


My question, then, is this: Are there any commonly accepted methods of aggregating polygons into as many groups as possible, so that all meet some minimum criteria?


Here are some rules that I would like to apply to the aggregation:



  • Whenever possible, flagged Tracts should only be aggregated with other flagged Tracts;

  • For the flagged Tracts that are not contiguous with any others (or isolated groupings that still do not meet the criteria), they can be joined with Tracts that already meet to criteria, although there may be Tracts with no employers in between them that will also need to be included.

  • I would like to keep county boundaries intact unless absolutely impossible (and I anticipate doing this by separating the inputs features into their respective counties before processing them).

  • The solution must be in Python, with the use of ArcGIS tools or open-source Python libraries.


Ideally, someone can point me to an existing means of implementing this aggregation. If not, I am happy to code the algorithm myself, although a list of specific steps/tools would be much appreciated. The problem strikes me as a special case of redistricting (with dis-contiguous polygons), and to this end I have looked into using PySAL's regionalization algorithms, although it is not clear to me how to check the maximum employer's percentage of total employees using these.




Answer



For anyone who's curious, I came up with a solution on my own, using PySAL's region.Maxp algorithm. Essentially, Max-p allows me to generate a set of regions that meets my first criterion (minimum number of employers per region), and I put that inside a while loop, which will reject any of Max-p's solutions that don't also satisfy the second criterion (percentage of employment contributed by the largest employer in a region). I've implemented it as an ArcGIS tool.


I decided to scrap the work I'd done previously to flag blocks/blockgroups/tracts and instead run Max-p on blocks (although I've been doing all my testing on tracts, as a modest increase in the number of input polygons has a dramatic effect on processing time). The relevant part of my code follows. The "shapefile" required as input for the generate_regions() function (passed as a string containing the full path of a shapefile) is one that has had the employers point features already spatially joined to it, with the number of employers, maximum employees from a single employer, and total employees stored as an attribute for each input feature.


import arcpy, math, pysal, random
import numpy as np

# Suppression criteria:
MIN_EMP_CT = 3 # Minimum number of employers per polygon feature
MAX_EMP_FRAC = 0.8 # Maximum ratio of employees working for a single employer per polygon feature


def generate_regions(shapefile, min_emp_ct=MIN_EMP_CT, max_emp_frac=MAX_EMP_FRAC):
'''Use pysal's region.Maxp method to generate regions that meet suppression criteria.'''
w = pysal.rook_from_shapefile(shapefile, idVariable='GEOID10')
dbf = pysal.open(shapefile[:-4] + '.dbf')
ids = np.array((dbf.by_col['GEOID10']))
vars = np.array((dbf.by_col[employer_count_fieldname],dbf.by_col[max_employees_fieldname],dbf.by_col[total_employees_fieldname]))
employers = vars[0]
vars = vars.transpose()
vars_dict = {}
for i in range(len(ids)):

vars_dict[ids[i]] = [int(vars[i][0]),float(vars[i][1]),float(vars[i][2])]
random.seed(100) # Using non-random seeds ensures repeatability of results
np.random.seed(100) # Using non-random seeds ensures repeatability of results
bump_iter = int(arcpy.GetParameterAsText(3)) # Number of failed iterations after which to increment the minimum number of employers per region (otherwise we could be stuck in the loop literally forever).
iteration = 0
tests_failed = 1
while tests_failed:
floor = int(min_emp_ct + math.floor(iteration / bump_iter))
solution = pysal.region.Maxp(w,vars,floor,employers)
regions_failed = 0

for region in solution.regions:
SUM_emp10sum = 0
MAX_emp10max = 0
for geo in region:
emp10max = vars_dict[geo][1]
emp10sum = vars_dict[geo][2]
SUM_emp10sum += emp10sum
MAX_emp10max = max(MAX_emp10max, emp10max)
if SUM_emp10sum > 0:
ratio = MAX_emp10max / SUM_emp10sum

else:
ratio = 1
if ratio >= max_emp_frac:
regions_failed += 1
iteration += 1
if regions_failed == 0:
arcpy.AddMessage('Iteration ' + str(iteration) + ' (MIN_EMP_CT = ' + str(floor) +') - PASSED!')
tests_failed = 0
else:
arcpy.AddMessage('Iteration ' + str(iteration) + ' (MIN_EMP_CT = ' + str(floor) +') - failed...')

return solution

solution = generate_regions(spatially_joined_shapefile)

regions = solution.regions

### Write input-to-region conversion table to a CSV file.
csv = open(conversion_table,'w')
csv.write('"GEOID10","REGION_ID"\n')
for i in range(len(regions)):

for geo in regions[i]:
csv.write('"' + geo + '","' + str(i+1) + '"\n')
csv.close()

No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...