Splitting shapefile into many shapefiles with open source?

Friday, 22 July 2016

Splitting shapefile into many shapefiles with open source?

I want to create a shapefile for each of the 72,000 census tracts in the United States that includes the census blocks in that tract using open source software.

I will probably start with a state shapefile that includes all of the blocks for that state and merge it with a GDB file. This will give me a list of blocks that are in the census tract.

But then I'm not sure how to split the state shapefile into smaller ones.

How can I do this?

Is there a way to automate this using open source software?

If so, what tools should I use?

I'm learning QGIS.

Answer

Basically, you can do the extract using ogr2ogr as long as you give the Census tract ID, so it's really an issue of getting 72,000 ogr2ogr calls.

ogr2ogr -where "tract = ''" /dest_folder /source_folder block_shapefile -nln block_shapefile_

Notes:

You don't have to specify the source format, ogr2ogr will figure it out. You specified shapefile, so that's what I'm assuming.

If you are not changing the format, you also don't have to specify the destination format. Otherwise, to get a shapefile, add `-f "ESRI Shapefile"

I'm using the -where switch to subset the data. Remember that attribute query is much faster than spatial query. Your question was a little vague, so I don't know if you intend to join the blocks to the tracts by attribute or spatially, but I highly recommend the former.

Note that I am assuming your blocks shapefile has a tract ID column in it. Most Census data sets will have a hierarchy of ID columns, i.e. the blocks shapefile will probably have a state, county, and tract, as well as block ID column, but may not have a column concatenating them all together. They must be concatenated, because counties are only unique with states, tracts are only unique with counties, and blocks are only unique within tracts. So you will have to either (a) create a new field with state & county & tract as your ID field, or add all three criteria to the where clause.

So how do you build 72,000 ogr2ogr calls programmatically? You can use any tool you want, but here's an example with R:

library(foreign)
dfBlocks = read.dbf("/source_folder/shapefile_name.dbf", as.is=TRUE)

strTract = unique(dfBlocks$tract_id)
for (i in 1:length(strTract)) {
  strOGR = paste(
    "ogr2ogr -where \"tract_id = '", strTract, 

    "'\" /dest_folder /source_folder layer_name -nln base_name_", 
    strTract, sep=""
    )
  system(strOGR)
}

You could also collect the system calls in one step, then iterate it to run the ogr2ogr call in a separate loop, or at a later time, or write it to a bash script that you run from the command line.

The major disadvantage is that it will scan the data source once for each ogr2ogr call, so actually importing the source data, iterating, and writing a shapefile for each row, would probably be more efficient. But I would recommend trying it in something other than R, which is somewhat slow at reading large spatial datasets (I ran an import just of the counties of the US, ~3000, and after 15 minutes I cancelled the import.)

Blog

Friday, 22 July 2016

Splitting shapefile into many shapefiles with open source?

No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?