Tuesday, 21 April 2015

r - How to join a table to a shapefile with non-matching IDs and names (similar strings)?


I'm having an annoying problem that I'm trying to find an automated solution for. The shorthand version is that I have a shapefile and a table of created data for regions within countries. The created data table does NOT have any sort of standardized GIDs/admin codes to match to shapefiles, and the region names are not exact matches either. Let's take a closer look; here's my dummy data frame + shapefile.


library(rgdal)

#load in shapefile
arm <- readOGR("D:/Country-Shapefiles/ARM_adm_shp", layer = "ARM_adm1")

#create dummy data frame
id <- c(100:110)
name <- c("Aragatsotn", "Ararat", "Armavir", "Gaghark'unik'", "Kotayk", "Lorri",

"Shirak", "Syunik'", "Tavush", "Vayots' Dzor", "Yerevan City")
value <- runif(11, 0.0, 1.0)
df <- data.frame(id, name, value)

So what I have is a table with seemingly random IDs, region names, and a value to be plotted with a choropleth map. Looks like this:


> df
id name value
1 100 Aragatsotn 0.6923852
2 101 Ararat 0.5762024
3 102 Armavir 0.4688358

4 103 Gaghark'unik' 0.4702253
5 104 Kotayk 0.9347992
6 105 Lorri 0.1937813
7 106 Shirak 0.5162604
8 107 Syunik' 0.4332389
9 108 Tavush 0.9889513
10 109 Vayots' Dzor 0.2182024
11 110 Yerevan City 0.5791886

Looking at the shapefile attributes of interest, we've got this:



> arm@data[c("ID_1", "NAME_1")]

ID_1 NAME_1
0 1 Aragatsotn
1 2 Ararat
2 3 Armavir
3 4 Erevan
4 5 Gegharkunik
5 6 Kotayk
6 7 Lori

7 8 Shirak
8 9 Syunik
9 10 Tavush
10 11 Vayots Dzor

Ideally, df would include some sort of matching admin IDs to join to the shapefile. Whoever created the data that I'm using did not follow these conventions, unfortunately. Alternatively, it'd be great to match the region names themselves...but as you can see, there are slight variations in each name.


Matching by hand is always a backup solution, but who wants to take the time to do that? ;) But really, even barring laziness, the project that I'm working on will be mapping dozens and dozens of different countries, so I'm looking for an automated solution that can do everything without having to do anything by hand. Is this possible? Can I somehow match these -almost- region names to the shapefiles?


Sidenote: I'm looking into grepl for partial string matches per this post, but I'm not sure if this is a potential solution because I'll need to draw from the column names rather than inputting each region name by hand.


EDIT: When I match the IDs by hand, what I've done is create a new column in my data frame and adding the exact matching terms from the shapefile. Unfortunately, because of the peculiarities of the data, the order of the names don't match up either, so this still requires some manual input. I'm hoping for some sort of completely automated solution (if it's even possible).



Answer




I would go for stringdist package which has implemented many algorithms to calculate the partial similarity (distance) of strings including Jaro-winkler. Here is a fast solution for you:


  #df to be joined
id <- c(100:111)
name <- c("Aragatsotn", "Ararat", "Armavir", "Gaghark'unik'", "Kotayk", "Lorri",
"Shirak", "Syunik'", "Tavush", "Vayots' Dzor", "Yerevan City","Aragatsotn")
value <- runif(12, 0.0, 1.0)
df <- data.frame(id, name, value)

#create shape data df
shpNames <- c("Aragatsotn",

"Ararat",
"Armavir",
"Erevan",
"Gegharkunik",
"Kotayk",
"Lori",
"Shirak",
"Syunik",
"Tavush",
"VayotsDzor")

arm.data <- data.frame(ID_1=1:11,NAME_1=shpNames)

#simple match (only testing)
match(df$name,arm.data$NAME_1)
#simple merge (testing)
merge(arm.data,df,by.x="NAME_1",by.y="name",all.x=TRUE)

#partial match using stringdist package
library("stringdist")
am<-amatch(arm.data$NAME_1,df$name,maxDist = 3)

b<-data.frame()
for (i in 1:dim(arm.data)[1]) {
b<-rbind(b,data.frame(arm.data[i,],df[am[i],]))
}
b

it outputs:


ID_1      NAME_1  id          name     value
1 1 Aragatsotn 100 Aragatsotn 0.8510984
2 2 Ararat 101 Ararat 0.3004329

3 3 Armavir 102 Armavir 0.9258740
4 4 Erevan NA NA
5 5 Gegharkunik 103 Gaghark'unik' 0.9935353
6 6 Kotayk 104 Kotayk 0.6025050
7 7 Lori 105 Lorri 0.9577662
8 8 Shirak 106 Shirak 0.6346550
9 9 Syunik 107 Syunik' 0.6531175
10 10 Tavush 108 Tavush 0.9726032
11 11 VayotsDzor 109 Vayots' Dzor 0.3457315


You can play with maxDist parameter of amatch method. Although 3 works best with your sample data!


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...