I'm working with data from the American Community Survey, provided by the U.S. Census. Specifically, I've downloaded gdb files from here.
I have the gdb file ACS_2016_5YR_TRACT.gdb
on my server and I'm attempting to create a psql table with the following:
ogr2ogr -f "PostgreSQL" PG:"host=host port=5432 dbname=db user=user password=password" ACS_2016_5YR_TRACT.gdb -overwrite -progress --config PG_USE_COPY YES
The output I get when running the above is a CREATE TABLE
command with over 1600 columns which it tries to execute but fails with:
ERROR: tables can have at most 1600 columns
Has anybody had success loading ACS5 data onto PostGIS? How could I get around this?
I have to imagine this is a common use case for people working with census data... I'm wondering if there is a way to pass flags to ogr2ogr that would allow it to partition the data? Maybe there's a way to increase the column limit in postgresql? I've tried changing the db settings and recompiling, as suggested in other answers, but haven't had much luck. I'm not sure if that's the best way to go about it though.
Answer
I figured out a way to get the gdb files into a database. Most of the process can be automated, but I'll have to manually prepare some files.
First, run the ogr2ogr
command that turns the gdb file into database tables, but this time skip failures with the -skipfailures
flag. Also, be sure to feed it a schema otherwise you'll be overwriting tables if you plan on storing data for multiple years and/or census geographies:
ogr2ogr -f "PostgreSQL" PG:"host=host port=5432 dbname=db user=user password=password" ACS_2016_5YR_TRACT.gdb -overwrite -progress --config PG_USE_COPY YES -skipfailures -lco SCHEMA=acs_2016_5yr_tract
After the command above is run, the gdb data should now mostly be in the database. Check the tables that were created by listing them. Because the tables are named sequentially, you can tell which tables failed to create:
geoservice=> \dt acs_2016_5yr_tract.*
List of relations
Schema | Name | Type | Owner
--------------------+-------------------------------------+-------+----------
acs_2016_5yr_tract | acs_2016_5yr_tract | table | geo_root
acs_2016_5yr_tract | tract_metadata_2016 | table | geo_root
acs_2016_5yr_tract | x00_counts | table | geo_root
acs_2016_5yr_tract | x01_age_and_sex | table | geo_root
acs_2016_5yr_tract | x02_race | table | geo_root
acs_2016_5yr_tract | x03_hispanic_or_latino_origin | table | geo_root
acs_2016_5yr_tract | x04_ancestry | table | geo_root
acs_2016_5yr_tract | x06_place_of_birth | table | geo_root
acs_2016_5yr_tract | x09_children_household_relationship | table | geo_root
acs_2016_5yr_tract | x10_grandparents_grandchildren | table | geo_root
acs_2016_5yr_tract | x11_household_family_subfamilies | table | geo_root
acs_2016_5yr_tract | x12_marital_status_and_history | table | geo_root
acs_2016_5yr_tract | x13_fertility | table | geo_root
acs_2016_5yr_tract | x14_school_enrollment | table | geo_root
acs_2016_5yr_tract | x15_educational_attainment | table | geo_root
acs_2016_5yr_tract | x16_language_spoken_at_home | table | geo_root
acs_2016_5yr_tract | x18_disability | table | geo_root
acs_2016_5yr_tract | x21_veteran_status | table | geo_root
acs_2016_5yr_tract | x22_food_stamps | table | geo_root
acs_2016_5yr_tract | x26_group_quarters | table | geo_root
acs_2016_5yr_tract | x99_imputation | table | geo_root
(21 rows)
Next, following the documentation, use ogr2ogr
to turn the gdb file into an SQL dump. Remember to give it a schema also. Running the following command will create a SQL file called ACS_2016_5YR_TRACT.sql
which contains all the statements to populate the database:
ogr2ogr --config PG_USE_COPY YES -f PGDump ACS_2016_5YR_TRACT.sql ACS_2016_5YR_TRACT.gdb -lco SRID=4269 -lco SCHEMA=acs_2016_5yr_tract
After the SQL dump completes, I found the following grep
command useful for listing out all the tables in the resulting sql file:
$ grep -n 'DROP TABLE IF EXISTS' ACS_2016_5YR_TRACT.sql
2:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x00_counts" CASCADE;
74015:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x01_age_and_sex" CASCADE;
148742:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x02_race" CASCADE;
223183:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x03_hispanic_or_latino_origin" CASCADE;
297302:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x04_ancestry" CASCADE;
371975:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x05_foreign_born_citizenship" CASCADE;
447676:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x06_place_of_birth" CASCADE;
522905:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x07_migration" CASCADE;
598706:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x08_commuting" CASCADE;
675255:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x09_children_household_relationship" CASCADE;
749576:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x10_grandparents_grandchildren" CASCADE;
823957:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x11_household_family_subfamilies" CASCADE;
898710:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x12_marital_status_and_history" CASCADE;
973477:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x13_fertility" CASCADE;
1047888:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x14_school_enrollment" CASCADE;
1122675:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x15_educational_attainment" CASCADE;
1197398:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x16_language_spoken_at_home" CASCADE;
1272277:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x17_poverty" CASCADE;
1350226:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x18_disability" CASCADE;
1425127:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x19_income" CASCADE;
1502128:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x20_earnings" CASCADE;
1578319:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x21_veteran_status" CASCADE;
1652892:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x22_food_stamps" CASCADE;
1727143:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x23_employment_status" CASCADE;
1802776:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x24_industry_occupation" CASCADE;
1878891:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x25_housing_characteristics" CASCADE;
1957314:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x26_group_quarters" CASCADE;
2031325:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x27_health_insurance" CASCADE;
2106968:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x99_imputation" CASCADE;
2181725:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."tract_metadata_2016" CASCADE;
2217196:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."acs_2016_5yr_tract" CASCADE;
Comparing the database table listing and the output from the SQL file grep
, we see the first table that failed is x05_foreign_born_citizenship
.
Because the grep
output above also lists line numbers, we see that lines 371975 through 447675 of ACS_2016_5YR_TRACT.sql
contain all the commands to create x05_foreign_born_citizenship
. (I confirmed this by opening the file, and you may want to confirm yourself a few times. The file is huge though, so it'll be slow going.)
Using the following sed
command, I can write the lines to a separate file, x05.txt
, that's smaller and easier to work with:
$ sed '371975,447675!d' ACS_2016_5YR_TRACT.sql > x05.txt
From here on out it's a matter of writing a script that manipulates the x05.txt
file. What you'll need to do is break up the one large 1600+ column table into multiple tables. I would offer my own script to the community, but can't because my company owns it. I used Python and pandas to handle it. The files are kind of big, so the server I was on kept on having memory errors. I had to do the work locally on 16GB of RAM. Keep that in mind if you run into memory issues... I could have saved processing power if I didn't use pandas, but it simplified a lot of things. YMMV...
In general terms, here is how I wrote the script:
You'll see the first part of x05.txt
is SQL DROP
, CREATE
, and ALTER
statements. The ALTER
statements are where the columns are added. My script counted the ALTER
statements and applied the first 1000 to x_05_00
table, the second 1000 to x_05_01
, etc. So update all the SQL statements and run them to create your tables and define the fields in the database.
One gotcha: Make sure all your subtables have the geoid
field.
Next, in x05.txt
, you'll come to an SQL COPY
statement which lists the column names. You can parse that string into a columns
list variable. Back to the file, the next set of lines immediately after the COPY
statement is the data itself in tab-delimited form. I wrote all the lines of the data to a csv using the columns
list variable as the csv header. (Originally I tried to read all the data into a dataframe, but it's too much to handle in memory - better to write it to disk.)
I then read the lines from the csv into a pandas dataframe in chunks
. I was able to slice each chunk by the first 1000 variables, and using .to_sql
was able to append the data to the x05_00
table. Similar for the next 1000 variables - those populate the x05_01
table, etc. Iterate through all the chunks, and you're done.
If your script is correct, then you'll have created something like the following. Also, if scripted correctly, the newly created tables will be indexed with default values, etc.
list tables:
geoservice=> \dt acs_2016_5yr_tract.*
List of relations
Schema | Name | Type | Owner
--------------------+-------------------------------------+-------+----------
acs_2016_5yr_tract | acs_2016_5yr_tract | table | geo_root
acs_2016_5yr_tract | tract_metadata_2016 | table | geo_root
acs_2016_5yr_tract | x00_counts | table | geo_root
acs_2016_5yr_tract | x01_age_and_sex | table | geo_root
acs_2016_5yr_tract | x02_race | table | geo_root
acs_2016_5yr_tract | x03_hispanic_or_latino_origin | table | geo_root
acs_2016_5yr_tract | x04_ancestry | table | geo_root
acs_2016_5yr_tract | x05_00_foreign_born_citizenship | table | geo_root
acs_2016_5yr_tract | x05_01_foreign_born_citizenship | table | geo_root
acs_2016_5yr_tract | x06_place_of_birth | table | geo_root
acs_2016_5yr_tract | x09_children_household_relationship | table | geo_root
acs_2016_5yr_tract | x10_grandparents_grandchildren | table | geo_root
acs_2016_5yr_tract | x11_household_family_subfamilies | table | geo_root
acs_2016_5yr_tract | x12_marital_status_and_history | table | geo_root
acs_2016_5yr_tract | x13_fertility | table | geo_root
acs_2016_5yr_tract | x14_school_enrollment | table | geo_root
acs_2016_5yr_tract | x15_educational_attainment | table | geo_root
acs_2016_5yr_tract | x16_language_spoken_at_home | table | geo_root
acs_2016_5yr_tract | x18_disability | table | geo_root
acs_2016_5yr_tract | x21_veteran_status | table | geo_root
acs_2016_5yr_tract | x22_food_stamps | table | geo_root
acs_2016_5yr_tract | x26_group_quarters | table | geo_root
acs_2016_5yr_tract | x99_imputation | table | geo_root
(23 rows)
table description:
geoservice=> \d acs_2016_5yr_tract.x05_00_foreign_born_citizenship
Table "acs_2016_5yr_tract.x05_00_foreign_born_citizenship"
Column | Type | Collation | Nullable | Default
-------------+-----------------------+-----------+----------+--------------------------------------------------------------------------------------
objectid | integer | | not null | nextval('acs_2016_5yr_tract.x05_00_foreign_born_citizenship_objectid_seq'::regclass)
geoid | character varying(19) | | |
b05001e1 | double precision | | |
...
...
...
b05006pre12 | double precision | | |
Indexes:
"x05_00_foreign_born_citizenship_pk" PRIMARY KEY, btree (objectid)
geoservice-> \d acs_2016_5yr_tract.x05_01_foreign_born_citizenship
Table "acs_2016_5yr_tract.x05_01_foreign_born_citizenship"
Column | Type | Collation | Nullable | Default
-------------+-----------------------+-----------+----------+--------------------------------------------------------------------------------------
objectid | integer | | not null | nextval('acs_2016_5yr_tract.x05_01_foreign_born_citizenship_objectid_seq'::regclass)
geoid | character varying(19) | | |
b05006prm12 | double precision | | |
...
...
...
b05014m19 | double precision | | |
Indexes:
"x05_01_foreign_born_citizenship_pk" PRIMARY KEY, btree (objectid)
table counts:
geoservice=> select count(*) from acs_2016_5yr_tract.x05_00_foreign_born_citizenship;
count
-------
74001
(1 row)
geoservice=> select count(*) from acs_2016_5yr_tract.x05_01_foreign_born_citizenship;
count
-------
74001
(1 row)
UPDATE:
All done - I migrated all the 2016 ACS5 gdb files to a PostGIS database. I had to split the tables into 500 columns each to avoid memory issues during processing.
geoservice=> \dt acs_2016_5yr_tract.*
List of relations
Schema | Name | Type | Owner
--------------------+-------------------------------------+-------+----------
acs_2016_5yr_tract | acs_2016_5yr_tract | table | geo_root
acs_2016_5yr_tract | tract_metadata_2016 | table | geo_root
acs_2016_5yr_tract | x00_counts | table | geo_root
acs_2016_5yr_tract | x01_age_and_sex | table | geo_root
acs_2016_5yr_tract | x02_race | table | geo_root
acs_2016_5yr_tract | x03_hispanic_or_latino_origin | table | geo_root
acs_2016_5yr_tract | x04_ancestry | table | geo_root
acs_2016_5yr_tract | x05_00_foreign_born_citizenship | table | geo_root
acs_2016_5yr_tract | x05_01_foreign_born_citizenship | table | geo_root
acs_2016_5yr_tract | x05_02_foreign_born_citizenship | table | geo_root
acs_2016_5yr_tract | x05_03_foreign_born_citizenship | table | geo_root
acs_2016_5yr_tract | x06_place_of_birth | table | geo_root
acs_2016_5yr_tract | x07_00_migration | table | geo_root
acs_2016_5yr_tract | x07_01_migration | table | geo_root
acs_2016_5yr_tract | x07_02_migration | table | geo_root
acs_2016_5yr_tract | x07_03_migration | table | geo_root
acs_2016_5yr_tract | x08_00_commuting | table | geo_root
acs_2016_5yr_tract | x08_01_commuting | table | geo_root
acs_2016_5yr_tract | x08_02_commuting | table | geo_root
acs_2016_5yr_tract | x08_03_commuting | table | geo_root
acs_2016_5yr_tract | x08_04_commuting | table | geo_root
acs_2016_5yr_tract | x08_05_commuting | table | geo_root
acs_2016_5yr_tract | x09_children_household_relationship | table | geo_root
acs_2016_5yr_tract | x10_grandparents_grandchildren | table | geo_root
acs_2016_5yr_tract | x11_household_family_subfamilies | table | geo_root
acs_2016_5yr_tract | x12_marital_status_and_history | table | geo_root
acs_2016_5yr_tract | x13_fertility | table | geo_root
acs_2016_5yr_tract | x14_school_enrollment | table | geo_root
acs_2016_5yr_tract | x15_educational_attainment | table | geo_root
acs_2016_5yr_tract | x16_language_spoken_at_home | table | geo_root
acs_2016_5yr_tract | x17_00_poverty | table | geo_root
acs_2016_5yr_tract | x17_01_poverty | table | geo_root
acs_2016_5yr_tract | x17_02_poverty | table | geo_root
acs_2016_5yr_tract | x17_03_poverty | table | geo_root
acs_2016_5yr_tract | x17_04_poverty | table | geo_root
acs_2016_5yr_tract | x17_05_poverty | table | geo_root
acs_2016_5yr_tract | x17_06_poverty | table | geo_root
acs_2016_5yr_tract | x17_07_poverty | table | geo_root
acs_2016_5yr_tract | x18_disability | table | geo_root
acs_2016_5yr_tract | x19_00_income | table | geo_root
acs_2016_5yr_tract | x19_01_income | table | geo_root
acs_2016_5yr_tract | x19_02_income | table | geo_root
acs_2016_5yr_tract | x19_03_income | table | geo_root
acs_2016_5yr_tract | x19_04_income | table | geo_root
acs_2016_5yr_tract | x19_05_income | table | geo_root
acs_2016_5yr_tract | x20_00_earnings | table | geo_root
acs_2016_5yr_tract | x20_01_earnings | table | geo_root
acs_2016_5yr_tract | x20_02_earnings | table | geo_root
acs_2016_5yr_tract | x20_03_earnings | table | geo_root
acs_2016_5yr_tract | x20_04_earnings | table | geo_root
acs_2016_5yr_tract | x21_veteran_status | table | geo_root
acs_2016_5yr_tract | x22_food_stamps | table | geo_root
acs_2016_5yr_tract | x23_00_employment_status | table | geo_root
acs_2016_5yr_tract | x23_01_employment_status | table | geo_root
acs_2016_5yr_tract | x23_02_employment_status | table | geo_root
acs_2016_5yr_tract | x23_03_employment_status | table | geo_root
acs_2016_5yr_tract | x24_00_industry_occupation | table | geo_root
acs_2016_5yr_tract | x24_01_industry_occupation | table | geo_root
acs_2016_5yr_tract | x24_02_industry_occupation | table | geo_root
acs_2016_5yr_tract | x24_03_industry_occupation | table | geo_root
acs_2016_5yr_tract | x24_04_industry_occupation | table | geo_root
acs_2016_5yr_tract | x25_00_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x25_01_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x25_02_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x25_03_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x25_04_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x25_05_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x25_06_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x25_07_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x25_08_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x26_group_quarters | table | geo_root
acs_2016_5yr_tract | x27_00_health_insurance | table | geo_root
acs_2016_5yr_tract | x27_01_health_insurance | table | geo_root
acs_2016_5yr_tract | x27_02_health_insurance | table | geo_root
acs_2016_5yr_tract | x27_03_health_insurance | table | geo_root
acs_2016_5yr_tract | x99_imputation | table | geo_root
(76 rows)
No comments:
Post a Comment