Monday 22 February 2016

ogr2ogr - Migrating US Census GDB file to PostGIS -- ERROR: tables can have at most 1600 columns


I'm working with data from the American Community Survey, provided by the U.S. Census. Specifically, I've downloaded gdb files from here.


I have the gdb file ACS_2016_5YR_TRACT.gdb on my server and I'm attempting to create a psql table with the following:


ogr2ogr -f "PostgreSQL" PG:"host=host port=5432 dbname=db user=user password=password" ACS_2016_5YR_TRACT.gdb -overwrite -progress --config PG_USE_COPY YES


The output I get when running the above is a CREATE TABLE command with over 1600 columns which it tries to execute but fails with:


ERROR:  tables can have at most 1600 columns

Has anybody had success loading ACS5 data onto PostGIS? How could I get around this?


I have to imagine this is a common use case for people working with census data... I'm wondering if there is a way to pass flags to ogr2ogr that would allow it to partition the data? Maybe there's a way to increase the column limit in postgresql? I've tried changing the db settings and recompiling, as suggested in other answers, but haven't had much luck. I'm not sure if that's the best way to go about it though.



Answer



I figured out a way to get the gdb files into a database. Most of the process can be automated, but I'll have to manually prepare some files.


First, run the ogr2ogr command that turns the gdb file into database tables, but this time skip failures with the -skipfailures flag. Also, be sure to feed it a schema otherwise you'll be overwriting tables if you plan on storing data for multiple years and/or census geographies:


ogr2ogr -f "PostgreSQL" PG:"host=host port=5432 dbname=db user=user password=password" ACS_2016_5YR_TRACT.gdb -overwrite -progress --config PG_USE_COPY YES -skipfailures -lco SCHEMA=acs_2016_5yr_tract


After the command above is run, the gdb data should now mostly be in the database. Check the tables that were created by listing them. Because the tables are named sequentially, you can tell which tables failed to create:


geoservice=> \dt acs_2016_5yr_tract.*
List of relations
Schema | Name | Type | Owner
--------------------+-------------------------------------+-------+----------
acs_2016_5yr_tract | acs_2016_5yr_tract | table | geo_root
acs_2016_5yr_tract | tract_metadata_2016 | table | geo_root
acs_2016_5yr_tract | x00_counts | table | geo_root
acs_2016_5yr_tract | x01_age_and_sex | table | geo_root
acs_2016_5yr_tract | x02_race | table | geo_root

acs_2016_5yr_tract | x03_hispanic_or_latino_origin | table | geo_root
acs_2016_5yr_tract | x04_ancestry | table | geo_root
acs_2016_5yr_tract | x06_place_of_birth | table | geo_root
acs_2016_5yr_tract | x09_children_household_relationship | table | geo_root
acs_2016_5yr_tract | x10_grandparents_grandchildren | table | geo_root
acs_2016_5yr_tract | x11_household_family_subfamilies | table | geo_root
acs_2016_5yr_tract | x12_marital_status_and_history | table | geo_root
acs_2016_5yr_tract | x13_fertility | table | geo_root
acs_2016_5yr_tract | x14_school_enrollment | table | geo_root
acs_2016_5yr_tract | x15_educational_attainment | table | geo_root

acs_2016_5yr_tract | x16_language_spoken_at_home | table | geo_root
acs_2016_5yr_tract | x18_disability | table | geo_root
acs_2016_5yr_tract | x21_veteran_status | table | geo_root
acs_2016_5yr_tract | x22_food_stamps | table | geo_root
acs_2016_5yr_tract | x26_group_quarters | table | geo_root
acs_2016_5yr_tract | x99_imputation | table | geo_root
(21 rows)



Next, following the documentation, use ogr2ogr to turn the gdb file into an SQL dump. Remember to give it a schema also. Running the following command will create a SQL file called ACS_2016_5YR_TRACT.sql which contains all the statements to populate the database:



ogr2ogr --config PG_USE_COPY YES -f PGDump ACS_2016_5YR_TRACT.sql ACS_2016_5YR_TRACT.gdb -lco SRID=4269 -lco SCHEMA=acs_2016_5yr_tract

After the SQL dump completes, I found the following grep command useful for listing out all the tables in the resulting sql file:


$ grep -n 'DROP TABLE IF EXISTS' ACS_2016_5YR_TRACT.sql
2:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x00_counts" CASCADE;
74015:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x01_age_and_sex" CASCADE;
148742:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x02_race" CASCADE;
223183:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x03_hispanic_or_latino_origin" CASCADE;
297302:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x04_ancestry" CASCADE;
371975:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x05_foreign_born_citizenship" CASCADE;

447676:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x06_place_of_birth" CASCADE;
522905:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x07_migration" CASCADE;
598706:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x08_commuting" CASCADE;
675255:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x09_children_household_relationship" CASCADE;
749576:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x10_grandparents_grandchildren" CASCADE;
823957:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x11_household_family_subfamilies" CASCADE;
898710:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x12_marital_status_and_history" CASCADE;
973477:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x13_fertility" CASCADE;
1047888:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x14_school_enrollment" CASCADE;
1122675:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x15_educational_attainment" CASCADE;

1197398:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x16_language_spoken_at_home" CASCADE;
1272277:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x17_poverty" CASCADE;
1350226:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x18_disability" CASCADE;
1425127:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x19_income" CASCADE;
1502128:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x20_earnings" CASCADE;
1578319:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x21_veteran_status" CASCADE;
1652892:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x22_food_stamps" CASCADE;
1727143:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x23_employment_status" CASCADE;
1802776:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x24_industry_occupation" CASCADE;
1878891:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x25_housing_characteristics" CASCADE;

1957314:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x26_group_quarters" CASCADE;
2031325:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x27_health_insurance" CASCADE;
2106968:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."x99_imputation" CASCADE;
2181725:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."tract_metadata_2016" CASCADE;
2217196:DROP TABLE IF EXISTS "acs_2016_5yr_tract"."acs_2016_5yr_tract" CASCADE;



Comparing the database table listing and the output from the SQL file grep, we see the first table that failed is x05_foreign_born_citizenship.


Because the grep output above also lists line numbers, we see that lines 371975 through 447675 of ACS_2016_5YR_TRACT.sql contain all the commands to create x05_foreign_born_citizenship. (I confirmed this by opening the file, and you may want to confirm yourself a few times. The file is huge though, so it'll be slow going.)


Using the following sed command, I can write the lines to a separate file, x05.txt, that's smaller and easier to work with:



$ sed '371975,447675!d' ACS_2016_5YR_TRACT.sql > x05.txt



From here on out it's a matter of writing a script that manipulates the x05.txt file. What you'll need to do is break up the one large 1600+ column table into multiple tables. I would offer my own script to the community, but can't because my company owns it. I used Python and pandas to handle it. The files are kind of big, so the server I was on kept on having memory errors. I had to do the work locally on 16GB of RAM. Keep that in mind if you run into memory issues... I could have saved processing power if I didn't use pandas, but it simplified a lot of things. YMMV...




In general terms, here is how I wrote the script:


You'll see the first part of x05.txt is SQL DROP, CREATE, and ALTER statements. The ALTER statements are where the columns are added. My script counted the ALTER statements and applied the first 1000 to x_05_00 table, the second 1000 to x_05_01, etc. So update all the SQL statements and run them to create your tables and define the fields in the database.


One gotcha: Make sure all your subtables have the geoid field.


Next, in x05.txt, you'll come to an SQL COPY statement which lists the column names. You can parse that string into a columns list variable. Back to the file, the next set of lines immediately after the COPY statement is the data itself in tab-delimited form. I wrote all the lines of the data to a csv using the columns list variable as the csv header. (Originally I tried to read all the data into a dataframe, but it's too much to handle in memory - better to write it to disk.)


I then read the lines from the csv into a pandas dataframe in chunks. I was able to slice each chunk by the first 1000 variables, and using .to_sql was able to append the data to the x05_00 table. Similar for the next 1000 variables - those populate the x05_01 table, etc. Iterate through all the chunks, and you're done.



If your script is correct, then you'll have created something like the following. Also, if scripted correctly, the newly created tables will be indexed with default values, etc.


list tables:


geoservice=> \dt acs_2016_5yr_tract.*
List of relations
Schema | Name | Type | Owner
--------------------+-------------------------------------+-------+----------
acs_2016_5yr_tract | acs_2016_5yr_tract | table | geo_root
acs_2016_5yr_tract | tract_metadata_2016 | table | geo_root
acs_2016_5yr_tract | x00_counts | table | geo_root
acs_2016_5yr_tract | x01_age_and_sex | table | geo_root

acs_2016_5yr_tract | x02_race | table | geo_root
acs_2016_5yr_tract | x03_hispanic_or_latino_origin | table | geo_root
acs_2016_5yr_tract | x04_ancestry | table | geo_root
acs_2016_5yr_tract | x05_00_foreign_born_citizenship | table | geo_root
acs_2016_5yr_tract | x05_01_foreign_born_citizenship | table | geo_root
acs_2016_5yr_tract | x06_place_of_birth | table | geo_root
acs_2016_5yr_tract | x09_children_household_relationship | table | geo_root
acs_2016_5yr_tract | x10_grandparents_grandchildren | table | geo_root
acs_2016_5yr_tract | x11_household_family_subfamilies | table | geo_root
acs_2016_5yr_tract | x12_marital_status_and_history | table | geo_root

acs_2016_5yr_tract | x13_fertility | table | geo_root
acs_2016_5yr_tract | x14_school_enrollment | table | geo_root
acs_2016_5yr_tract | x15_educational_attainment | table | geo_root
acs_2016_5yr_tract | x16_language_spoken_at_home | table | geo_root
acs_2016_5yr_tract | x18_disability | table | geo_root
acs_2016_5yr_tract | x21_veteran_status | table | geo_root
acs_2016_5yr_tract | x22_food_stamps | table | geo_root
acs_2016_5yr_tract | x26_group_quarters | table | geo_root
acs_2016_5yr_tract | x99_imputation | table | geo_root
(23 rows)


table description:


geoservice=> \d acs_2016_5yr_tract.x05_00_foreign_born_citizenship
Table "acs_2016_5yr_tract.x05_00_foreign_born_citizenship"
Column | Type | Collation | Nullable | Default
-------------+-----------------------+-----------+----------+--------------------------------------------------------------------------------------
objectid | integer | | not null | nextval('acs_2016_5yr_tract.x05_00_foreign_born_citizenship_objectid_seq'::regclass)
geoid | character varying(19) | | |
b05001e1 | double precision | | |
...

...
...

b05006pre12 | double precision | | |
Indexes:
"x05_00_foreign_born_citizenship_pk" PRIMARY KEY, btree (objectid)


geoservice-> \d acs_2016_5yr_tract.x05_01_foreign_born_citizenship
Table "acs_2016_5yr_tract.x05_01_foreign_born_citizenship"

Column | Type | Collation | Nullable | Default
-------------+-----------------------+-----------+----------+--------------------------------------------------------------------------------------
objectid | integer | | not null | nextval('acs_2016_5yr_tract.x05_01_foreign_born_citizenship_objectid_seq'::regclass)
geoid | character varying(19) | | |
b05006prm12 | double precision | | |
...
...
...
b05014m19 | double precision | | |
Indexes:

"x05_01_foreign_born_citizenship_pk" PRIMARY KEY, btree (objectid)

table counts:


geoservice=> select count(*) from acs_2016_5yr_tract.x05_00_foreign_born_citizenship;
count
-------
74001
(1 row)

geoservice=> select count(*) from acs_2016_5yr_tract.x05_01_foreign_born_citizenship;

count
-------
74001
(1 row)

UPDATE:


All done - I migrated all the 2016 ACS5 gdb files to a PostGIS database. I had to split the tables into 500 columns each to avoid memory issues during processing.


geoservice=> \dt acs_2016_5yr_tract.*
List of relations
Schema | Name | Type | Owner

--------------------+-------------------------------------+-------+----------
acs_2016_5yr_tract | acs_2016_5yr_tract | table | geo_root
acs_2016_5yr_tract | tract_metadata_2016 | table | geo_root
acs_2016_5yr_tract | x00_counts | table | geo_root
acs_2016_5yr_tract | x01_age_and_sex | table | geo_root
acs_2016_5yr_tract | x02_race | table | geo_root
acs_2016_5yr_tract | x03_hispanic_or_latino_origin | table | geo_root
acs_2016_5yr_tract | x04_ancestry | table | geo_root
acs_2016_5yr_tract | x05_00_foreign_born_citizenship | table | geo_root
acs_2016_5yr_tract | x05_01_foreign_born_citizenship | table | geo_root

acs_2016_5yr_tract | x05_02_foreign_born_citizenship | table | geo_root
acs_2016_5yr_tract | x05_03_foreign_born_citizenship | table | geo_root
acs_2016_5yr_tract | x06_place_of_birth | table | geo_root
acs_2016_5yr_tract | x07_00_migration | table | geo_root
acs_2016_5yr_tract | x07_01_migration | table | geo_root
acs_2016_5yr_tract | x07_02_migration | table | geo_root
acs_2016_5yr_tract | x07_03_migration | table | geo_root
acs_2016_5yr_tract | x08_00_commuting | table | geo_root
acs_2016_5yr_tract | x08_01_commuting | table | geo_root
acs_2016_5yr_tract | x08_02_commuting | table | geo_root

acs_2016_5yr_tract | x08_03_commuting | table | geo_root
acs_2016_5yr_tract | x08_04_commuting | table | geo_root
acs_2016_5yr_tract | x08_05_commuting | table | geo_root
acs_2016_5yr_tract | x09_children_household_relationship | table | geo_root
acs_2016_5yr_tract | x10_grandparents_grandchildren | table | geo_root
acs_2016_5yr_tract | x11_household_family_subfamilies | table | geo_root
acs_2016_5yr_tract | x12_marital_status_and_history | table | geo_root
acs_2016_5yr_tract | x13_fertility | table | geo_root
acs_2016_5yr_tract | x14_school_enrollment | table | geo_root
acs_2016_5yr_tract | x15_educational_attainment | table | geo_root

acs_2016_5yr_tract | x16_language_spoken_at_home | table | geo_root
acs_2016_5yr_tract | x17_00_poverty | table | geo_root
acs_2016_5yr_tract | x17_01_poverty | table | geo_root
acs_2016_5yr_tract | x17_02_poverty | table | geo_root
acs_2016_5yr_tract | x17_03_poverty | table | geo_root
acs_2016_5yr_tract | x17_04_poverty | table | geo_root
acs_2016_5yr_tract | x17_05_poverty | table | geo_root
acs_2016_5yr_tract | x17_06_poverty | table | geo_root
acs_2016_5yr_tract | x17_07_poverty | table | geo_root
acs_2016_5yr_tract | x18_disability | table | geo_root

acs_2016_5yr_tract | x19_00_income | table | geo_root
acs_2016_5yr_tract | x19_01_income | table | geo_root
acs_2016_5yr_tract | x19_02_income | table | geo_root
acs_2016_5yr_tract | x19_03_income | table | geo_root
acs_2016_5yr_tract | x19_04_income | table | geo_root
acs_2016_5yr_tract | x19_05_income | table | geo_root
acs_2016_5yr_tract | x20_00_earnings | table | geo_root
acs_2016_5yr_tract | x20_01_earnings | table | geo_root
acs_2016_5yr_tract | x20_02_earnings | table | geo_root
acs_2016_5yr_tract | x20_03_earnings | table | geo_root

acs_2016_5yr_tract | x20_04_earnings | table | geo_root
acs_2016_5yr_tract | x21_veteran_status | table | geo_root
acs_2016_5yr_tract | x22_food_stamps | table | geo_root
acs_2016_5yr_tract | x23_00_employment_status | table | geo_root
acs_2016_5yr_tract | x23_01_employment_status | table | geo_root
acs_2016_5yr_tract | x23_02_employment_status | table | geo_root
acs_2016_5yr_tract | x23_03_employment_status | table | geo_root
acs_2016_5yr_tract | x24_00_industry_occupation | table | geo_root
acs_2016_5yr_tract | x24_01_industry_occupation | table | geo_root
acs_2016_5yr_tract | x24_02_industry_occupation | table | geo_root

acs_2016_5yr_tract | x24_03_industry_occupation | table | geo_root
acs_2016_5yr_tract | x24_04_industry_occupation | table | geo_root
acs_2016_5yr_tract | x25_00_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x25_01_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x25_02_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x25_03_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x25_04_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x25_05_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x25_06_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x25_07_housing_characteristics | table | geo_root

acs_2016_5yr_tract | x25_08_housing_characteristics | table | geo_root
acs_2016_5yr_tract | x26_group_quarters | table | geo_root
acs_2016_5yr_tract | x27_00_health_insurance | table | geo_root
acs_2016_5yr_tract | x27_01_health_insurance | table | geo_root
acs_2016_5yr_tract | x27_02_health_insurance | table | geo_root
acs_2016_5yr_tract | x27_03_health_insurance | table | geo_root
acs_2016_5yr_tract | x99_imputation | table | geo_root
(76 rows)

No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...