Wednesday, 15 January 2020

postgis - Looking for fastest solution for Point in Polygon analysis of 200 million points



I have a CSV containing 200 million observations with the following format:


id,x1,y1,x2,y2,day,color
1,"-105.4652334","39.2586939","-105.4321296","39.2236632","Monday","Black"
2,"-105.3224523","39.1323299","-105.4439944","39.3352235","Tuesday","Green"
3,"-104.4233452","39.0234355","-105.4643990","39.1223435","Wednesday","Blue"


For each set of coordinates (x1/y1 and x2/y2), I want to assign the US Census Tract or Census Block that it falls within (I downloaded the Census tract TIGER shapefile here: ftp://ftp2.census.gov/geo/tiger/TIGER2011/TRACT/tl_2011_08_tract.zip). So, I need to do a point-in-polygon operation twice for each observation. It's important that the matches be very accurate.


What is the fastest way to do this, including time to learn the software? I have access to a computer with 48GB of Memory -- in case that might be a relevant constraint.


Several threads recommend using PostGIS or Spatialite (Spatialite looks easier to use -- but is it as efficient as PostGIS?). If those are the best options, is it imperative to populate an Spatial Index (RTree?)? If so, how does one do so (e.g. using the Census Tract Shapefile)? I would be extremely grateful for any recommendations that include example code (or a pointer to example code).


My first attempt (before finding this site) consisted of using ArcGIS to do a spatial join (x1/y1 only) of subsample of the data (100,000 points) on US Census Block. That took over 5 hours before I killed the process. I'm hoping for a solution that can be implemented on the entire dataset in under 40 hours of computing time.


Apologies for asking a question that has been asked before -- I've read through answers, and I'm left wondering how to implement the recommendations. I've never used SQL, Python, C, and have only used ArcGIS once before -- I'm a complete beginner.



Answer



ST_DWithin was faster in my test than ST_Intersects. That is surprising, especially since the prepared geometry algorithm is supposed to kick in on cases like this. I think there is a chance that this will be quite a lot faster than I showed here.




I did some more tests and two things almost 10-doubled the speed. First, I tried on a newer computer, but still a quite ordinary laptop, maybe except from SATA3 ssd -disks.



Then the query below took 18 seconds instead of 62 seconds on the old laptop. Next I found that I was totally wrong before when I wrote that the index on the point-table wasn't necessary. With that index in place ST_Intersects behaved as expected and things became very fast. I increased the number of points in the point-table to 1 million points and the query:


CREATE TABLE points_ct AS
SELECT imported_ct.gid as ct_gid, t.gid as point_id
FROM imported_ct , t WHERE ST_Intersects(imported_ct.geom , t.geom);

runs in 72 seconds. Since there is 1249 polygons, 1249000000 tests is done in 72 seconds. That makes about 17000000 tests per second. Or testing almost 14000 points against all polygons per second.


From this test your 400000000 points to test should take about 8 hours without any trouble with distributing the load to several cores. PostGIS never stops to impress me :-)




First, to visualize the result you can add the point geometry to the resulting table, open it in QGIS for instance and style it with unique values on the imported_ct field.


Second, yes, you can get also the points falling outside any polygon by using a right (or left) join like this:



CREATE TABLE points_ct AS
SELECT imported_ct.gid as ct_gid, t.gid as point_id
FROM imported_ct right join t ON ST_Intersects(imported_ct.the_geom , t.geom);



I did some tests to verify if it seems possible PostGIS.


First something I don't understand. You have two points per row. Is always both points in the same polygon? Then it is enough to do the calculations on one of the points. If they can be in two different polygons you will need a way to connect one point row to two polygons.


From the tests it seems like doable, but you might need some creative solution to spread the load over more than one cpu-core.


I tested on a 4 year old laptop with dual core centrino cpu (about 2.2GHz I think) , 2GB RAM. If you have 48 BG RAM I guess you have a lot more cpu-power too.


What I did was to create a random point table with 100000 points like this:



CREATE TABLE t AS
WITH r AS
(SELECT ST_Extent(the_geom)::geometry ext FROM imported_ct)
SELECT ST_Point(x,y) AS geom FROM
(SELECT GENERATE_SERIES(1,100000)) s,
(SELECT ST_Xmin(ext)+(random()*(ST_Xmax(ext)-ST_Xmin(ext))) x, ST_Ymin(ext)+(random()*(ST_Ymax(ext)-ST_Ymin(ext))) y FROM r
) f;

Then adding a gid like:


ALTER TABLE t ADD COLUMN GID SERIAL;


Then running :


CREATE TABLE points_ct AS
SELECT imported_ct.gid as ct_gid, t.gid as point_id FROM imported_ct , t WHERE ST_Dwithin(imported_ct.the_geom , t.geom,0);

takes about 62 seconds (compare to your ArcGIS result with the same amount of points). The result is a table connecting the points in my table t with the gid in the table with census tract.


With that speed you will do 200 mill points in about 34 hours. So, if it is enough with checking one of the point, my old laptop can do it with one core.


But if you need to check both points it might be harder.


THen you can manually distribute the load to more than one core by starting multiple sessions against the db and run different queries.


In my example with 50000 points and two cpu-cores I tried :



CREATE TABLE t1 as
SELECT imported_ct.gid as ct_gid, t.gid as point_id FROM imported_ct , t WHERE t.gid >50000 and ST_Dwithin(imported_ct.the_geom , t.geom,0);

on one db-session at the same time as running:


CREATE TABLE t2 as
SELECT imported_ct.gid as ct_gid, t.gid as point_id FROM imported_ct , t WHERE t.gid <=50000 and ST_Dwithin(imported_ct.the_geom , t.geom,0);

on another db-session.


That took about 36 seconds so it is a little bit slower than the first example probably depending on disc writing at the same time. But since bith cores are working at the same time it didn't take more than 36 seconds of my time.


To union table t1 and t2 a tried:



CREATE TABLE t3 AS 
SELECT * FROM t1
UNION ALL
SELECT * FROM t2;

using about half a second.


So, with fresher hardware and distributing the load over many cores this should absolutely be possible even if the real world will be slower than the test case.


Might be worth noting that the example is from Linux (Ubuntu). Using Windows will be another story. But I have all other daily applications running so the laptop is quite heavily loaded from before. So that might simulate the windows case quite well maybe, without opening anything but pgadmin.


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...