Thursday 28 April 2016

postgis - Performance Gain Through GIST Index for Point in Polygon Query


I have two tables: locations(id,region_id,the_geom) and regions(id,the_geom). For each location point I want to determine the region it is located in:


UPDATE locations SET region_id = 
(SELECT id FROM regions

WHERE ST_Within(locations.the_geom,regions.the_geom)
);

Does it make sense to build a GIST index on the location points? I'll build an index on the region polygons but I'm unsure about the points. Would it speed up the query?



Answer



Short answer: No. With this type of UPDATE query, we are updating each row in locations ("Seq Scan"), and the GiST index on the_geom in regions is sufficient in helping limit rows for the ST_Within condition to pair-up the right row from regions.




Longer answer: The magic to figuring this out is to compare what you get from explain query. From pgAdmin III, there is an "Explain query" button at the top of a query editor, or from pgsql, just prefix your query with "explain":


postgis=# explain UPDATE locations SET region_id =
postgis-# (SELECT id FROM regions

postgis(# WHERE ST_Within(locations.the_geom, regions.the_geom)
postgis(# );
QUERY PLAN
--------------------------------------------------------------------------------------------
Seq Scan on locations (cost=0.00..8755.54 rows=1000 width=110)
SubPlan 1
-> Index Scan using regions_gist_the_geom on regions (cost=0.00..8.52 rows=1 width=4)
Index Cond: ($0 && the_geom)
Filter: _st_within($0, the_geom)
(5 rows)


You don't need to understand everything that is coughed up here. The key thing to see here is in the inner-most part (SubPlan 1) it indicates "Index" (= uses an index, which could speed things up considerably), and not "Seq Scan" (= sequence scan, i.e. checking each row to see if it is within, which can be slower). If you add/delete a GiST index on locations, the output of this explain query is exactly the same, so the query performance should be the same.


However, if you do something silly, and remove your GiST index from regions, you see a different query plan from the same query as above:


                             QUERY PLAN
---------------------------------------------------------------------
Seq Scan on locations (cost=0.00..74288.00 rows=1000 width=110)
SubPlan 1
-> Seq Scan on regions (cost=0.00..74.05 rows=1 width=4)
Filter: (($0 && the_geom) AND _st_within($0, the_geom))
(4 rows)


The important thing to see between the two explain queries is the maximum cost estimates .. contrast 74.05 here to 8.52 before, so you'd expect this query to be slower.


No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...