Sunday 23 September 2018

distance - How to group 10k points into closest pairs?


I was first faced with a problem of grouping 1000 pairs into unique pairs to minimise the total distance - since this was a small number I solved it using linear optimisation:



Part 1 - Basic/Naive approach:


A bit more detail on it here but the gist is:


Create an objective function which wants to minimise all the entries in the lower triangle of a distance matrix (multiplied by the respective dummy). For example v1_10 means that point 1 and point 10 are connected. Such that each store is connected to only one (other store).


enter image description here


The result of the optimisation takes us from the first picture to the second. The structure of the code is thus:


shops = [[id, latitude, longitude],...]

The helper functions to create the optimisation:


def store_connect(shops):
"""

Cross stores and return connection e.g. "v1-2" by ID
"""
for i in range(len(shops)-1):
for j in range(i+1,len(shops)):
yield 'v'+str(shops[i][0])+'-'+str(shops[j][0])


def distances(shops):
"""
Return distance in miles for crosses

"""
for i in range(len(shops)-1):
for j in range(i+1,len(shops)):
yield haversine([shops[i][1],shops[i][2]],
[shops[j][1],shops[j][2]])


def all_constraints(shops):
"""
Return constraint row

"""
for a in shops:
cons = []
for b in shops:
if a[0] cons.append("v"+str(a[0])+"-"+str(b[0]))
elif a[0]>b[0]:
cons.append("v"+str(b[0])+"-"+str(a[0]))
yield cons


And then the optimisation itself:


prob = LpProblem("Minimise distance",LpMinimize)

# Variables
x = []
xnames = []
for one in store_connect(shops):
xnames.append(one)
x.append(LpVariable(one,0,1,LpInteger))
print("Created %d variables" % len(x))


# Objective
prob += pulp.lpSum([i[1]*x[i[0]] for i in enumerate(distances(shops))])
print("Created objective")

# Constraints
counter = 0
for constraint_rw in all_constraints(shops):
prob += pulp.lpSum(x[xnames.index(one_cn)] for one_cn in constraint_rw) == 1
counter += 1

if counter % 10 == 0:
print("Added %d contraints out of %d" % (counter,len(shops)))

Which is straight-forward to solve directly or upload the .lp file to CPLEX:


#Using PULP - if not too complicated ... otherwise CPLEX:
prob.solve()
# The status of the solution is printed to the screen
print("Status:", LpStatus[prob.status])

# Optimum variable values

sol = []
for v in prob.variables():
if v.varValue:
sol.append(v.name)
print("Total minimised distance: ", value(prob.objective))

shp_names = [row[0] for row in shops]
edges = []
for x in sol:
v_list = (x)[1:].split("_") #remove the first letter 'v' and split

conn = [it for sub in [shops[shp_names.index(float(v_list[0]))][1:3],
shops[shp_names.index(float(v_list[1]))][1:3]] for it in sub]
edges.append(conn)

# Print diagram
plot_points(shops,edges)

Part 2 - Better approach (for lots of points):


If I expand the problem to 10,000 points I get roughly 50,000,000 variables. I was thinking of the following ways of simplifying the problem:


1) Use clustering to group the 10,000 points into something like 100 groups and then solve 100 times within those groups (however I'm not sure how to check whether the solution would be optimal).



2) Project the points; put them into a quad-tree and then instead of allowing connections between one point and any other point -> only allow connections to the closest 100 points (e.g.): however, I am not sure how to properly set this (the cut-off). I thought perhaps if a point is chosen that is the cut-off point then the algorithm will restart with a larger cut-off.


3) Run something like a minimum spanning tree or concorde tsp algorithm and then back-track the solution from that?




No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...