Monday, 7 December 2015

r - How to find the most spread out locations?


Probably a very simple question. I have a list of about a thousand potential geographical locations (lat lon), and of those I need to select the 200 locations that are "most spread out". I think that would be the 200 points with highest total average distance. Think of stores in a city.



Is there a defined method to do it? Maybe in an R-package?


Thanks to everyone making this the great place to learn it is!


/Chris



Answer



Simple question, difficult solution.


The best method I know uses simulated annealing (I have used this to select a few dozen points out of tens of thousands and it scales extremely well to selecting 200 points: the scaling is sublinear), but this requires careful coding and considerable experimentation, as well as a huge amount of computation. You should start by looking at simpler, faster methods first to see whether they will suffice.


One way is first to cluster the store locations. Within each cluster select the store closest to the cluster center.


A really fast clustering method is K-means. Here is an R solution that uses it.


scatter <- function(points, nClusters) {
#

# Find clusters. (Different methods will yield different results.)
#
clusters <- kmeans(points, nClusters)
#
# Select the point nearest the center of each cluster.
#
groups <- clusters$cluster
centers <- clusters$centers
eps <- sqrt(min(clusters$withinss)) / 1000
distance <- function(x,y) sqrt(sum((x-y)^2))

f <- function(k) distance(centers[groups[k],], points[k,])
n <- dim(points)[1]
radii <- apply(matrix(1:n), 1, f) + runif(n, max=eps)
# (Distances are changed randomly to select a unique point in each cluster.)
minima <- tapply(radii, groups, min)
points[radii == minima[groups],]
}

The arguments to scatter are a list of store locations (as an n by 2 matrix) and the number of stores to select (e.g., 200). It returns an array of locations.


As an example of its application, let's generate n = 1000 randomly located stores and see what the solution looks like:



# Create random points for testing.
#
set.seed(17)
n <- 1000
nClusters <- 200
points <- matrix(rnorm(2*n, sd=10), nrow=n, ncol=2)
#
# Do the work.
#
system.time(centers <- scatter(points, nClusters))

#
# Map the stores (open circles) and selected ones (closed circles).
#
plot(centers, cex=1.5, pch=19, col="Gray", xlab="Easting (Km)", ylab="Northing")
points(points, col=hsv((1:nClusters)/(nClusters+1), v=0.8, s=0.8))

This calculation took 0.03 seconds:


Figure


You can see it's not great (but it's not too bad, either). To do much better is going to require either stochastic methods, like simulated annealing, or algorithms that are likely to scale exponentially with the size the problem. (I have implemented such an algorithm: it takes 12 seconds for selecting the 10 most widely spaced points out of 20. Applying it to 200 clusters is out of the question.)


A good alternative to K-means is a hierarchical clustering algorithm; try "Ward's" method first and consider experimenting with other linkages. This will take more computation, but we're still talking about just a few seconds for 1000 stores and 200 clusters.



Other methods exist. For instance, you could cover the region with a regular hexagonal grid and, for cells containing one or more stores, select the store nearest its center. Play a little with the cellsize until approximately 200 stores have been selected. This will produce a very regular spacing of stores, which you might or might not want. (If these are truly store locations, this would probably be a bad solution, because it would have a tendency to choose stores in the least populous areas. In other applications this might be a much better solution.)


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...