Monday 4 February 2019

postgresql - How to properly set up indexes for PostGIS distance queries?


I'm building an application that is supposed to query and return every Record in a table that is X kilometers away from PointX. Records and PointX's positions are determined from (long/lat) information provided by Google Geocode API.


I'm new to PostGIS. After a quick research, I've found this question. The answer seems to be along the lines of:



SELECT *
FROM your_table
WHERE ST_Distance_Sphere(the_geom, ST_MakePoint(your_lon,your_lat)) <= radius_mi * 1609.34

The problem is: Even though I'm only getting started at GIS, when I look at the above query, I can't possibly imagine how this can use an index. There are 2 function calls. I imagine the table being scanned for every Record. I want to be wrong :)


Question: Does PostGIS have any index type capable of making the above query performant? If not, what would be the recommended approach doing what I need?



Answer



There are two keys to getting good geodetic query performance with large tables with geometry columns using WGS 1984 geographic data (SRID 4326):



  1. Use the ST_DWithin function, which searches using an available spatial index, and will find geography features with a Cartesian distance


  2. Build an extra index on the geography cast, so ST_DWithin can use it


So let's look at what happens in the real world. First we need to create and populate a table of one million random points:


DROP TABLE IF EXISTS example1
;

CREATE TABLE example1 (
idcol serial NOT NULL,
geomcol geometry NULL,
CONSTRAINT example1_pk PRIMARY KEY (idcol),

CONSTRAINT enforce_srid CHECK (st_srid(geomcol) = 4326)
)
with (
OIDS=FALSE
);

INSERT INTO example1(geomcol)
SELECT ST_SetSRID(
ST_MakePoint(
(random()*360.0) - 180.0,

(acos(1.0 - 2.0 * random()) * 2.0 - pi()) * 90.0 / pi()),
4326) as geomcol
FROM generate_series(1, 1000000) vtab;

CREATE INDEX example1_spx ON example1 USING GIST (geomcol);
-- (took about 22 sec)

If we execute the ST_Distance query, we get your expected full table scan:


EXPLAIN ANALYZE VERBOSE
SELECT count(*)

FROM example1
WHERE ST_Distance(geomcol::geography,ST_SetSRID(ST_MakePoint(6.9333,46.8167),4326)::geography) < 30 * 1609.34
;

Aggregate (cost=274167.33..274167.34 rows=1 width=0) (actual time=4940.531..4940.532 rows=1 loops=1)
Output: count(*)
-> Seq Scan on bob.example1 (cost=0.00..273334.00 rows=333333 width=0) (actual time=592.766..4940.509 rows=11 loops=1)
Output: idcol, geomcol
Filter: (_st_distance((example1.geomcol)::geography, '0101000020E61000005D6DC5FEB2BB1B40545227A089684740'::geography, 0::double precision, true) < 48280.2::double precision)
Rows Removed by Filter: 999989

Planning time: 2.137 ms
Execution time: 4940.568 ms

Now, if we use ST_DWithin, we still get a full table scan (albeit a faster one):


EXPLAIN ANALYZE VERBOSE
SELECT count(*)
FROM example1
WHERE ST_DWithin(geomcol::geography,ST_SetSRID(ST_MakePoint(6.9333,46.8167),4326)::geography,30 * 1609.34)
;


Aggregate (cost=405867.33..405867.34 rows=1 width=0) (actual time=908.716..908.716 rows=1 loops=1)
Output: count(*)
-> Seq Scan on bob.example1 (cost=0.00..405834.00 rows=13333 width=0) (actual time=38.449..908.700 rows=7 loops=1)
Output: idcol, geomcol
Filter: (((example1.geomcol)::geography && '0101000020E61000005D6DC5FEB2BB1B40545227A089684740'::geography) AND ('0101000020E61000005D6DC5FEB2BB1B40545227A089684740'::geography && _st_expand((example1.geomcol)::geography, 48280.2::double precision) (...)
Rows Removed by Filter: 999993
Planning time: 2.017 ms
Execution time: 908.763 ms

And this is the last piece -- Building the covering index (cast geography):



CREATE INDEX example1_gpx ON example1 USING GIST (geography(geomcol));
-- (Takes an extra 13 sec)

EXPLAIN ANALYZE VERBOSE
SELECT count(*)
FROM example1
WHERE ST_DWithin(geomcol::geography,ST_SetSRID(ST_MakePoint(6.9333,46.8167),4326)::geography,30 * 1609.34)
;

Aggregate (cost=96538.95..96538.96 rows=1 width=0) (actual time=0.775..0.775 rows=1 loops=1)

Output: count(*)
-> Bitmap Heap Scan on bob.example1 (cost=8671.62..96505.62 rows=13333 width=0) (actual time=0.586..0.769 rows=19 loops=1)
Output: idcol, geomcol
Recheck Cond: ((example1.geomcol)::geography && '0101000020E61000005D6DC5FEB2BB1B40545227A089684740'::geography)
Filter: (('0101000020E61000005D6DC5FEB2BB1B40545227A089684740'::geography && _st_expand((example1.geomcol)::geography, 48280.2::double precision)) AND _st_dwithin((example1.geomcol)::geography, '0101000020E61000005D6DC5FEB2BB1B40545227A089684740':: (...)
Rows Removed by Filter: 14
Heap Blocks: exact=33
-> Bitmap Index Scan on example1_gpx (cost=0.00..8668.29 rows=200000 width=0) (actual time=0.384..0.384 rows=33 loops=1)
Index Cond: ((example1.geomcol)::geography && '0101000020E61000005D6DC5FEB2BB1B40545227A089684740'::geography)
Planning time: 2.572 ms

Execution time: 0.820 ms

Finally, the optimizer is using the spatial index, and it shows, but what's three orders of magnitude between friends?


Some caveats:




  • I'm a database nerd, so my home PC has got 16Gb RAM, six 3.3Ghz cores, and a 256Gb SSD for the database default tablespace; your mileage may vary




  • I re-ran the creation SQL before each query, to level the playing field with respect to "hot" pages in the cache, but this could produce slightly different results because the same random seed wasn't used for different runs





And a note:



  • I tweaked the original {-90,+90} latitude range to use arc-cosine for an equal-area distribution (less biased toward the poles)


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...