Wednesday 27 June 2018

Difference between JOIN ON and WHERE in Virtual Layers in QGIS


There is already a question on this topic Difference between SpatialJoin with “…where…” and “join…on…”. However, it is dedicated to PostGIS.


Since in the Virtual Layer query window, I cannot simply proceed with EXPLAIN ANALYZE command before my SELECT, therefore cannot see the difference between two queries and moreover it uses SQLite and Spatialite to operate.


So, what is the difference between those queries in terms of Virtual Layers and QGIS?


SELECT d.id, COUNT(p.id) PoInPol

FROM districts AS d, points AS p
WHERE st_within(p.geometry, d.geometry)
GROUP BY d.id

Versus


SELECT d.id, COUNT(p.id) PoInPol
FROM districts AS d
JOIN points AS p ON st_within(p.geometry, d.geometry)
GROUP BY d.id


Those queries simply give a number of points inside each feature of districts layer.



Answer



DB Query Planners are (usually) smart and will rewrite the query to make it the most efficient. BUT, there is still a difference between the two queries (especially if we ignore the planner intervention).


For the example in the question, it makes no difference. If you use LEFT/RIGHT/OUTER JOINs, it does. If you add extra conditions, it also makes a difference. And if you have several joins, it can have a great impact, both in terms of results and efficiency.


When the filtering condition is put inside the JOIN, it is evaluated right away and only rows satisfying the condition are used in the next join.


When the filtering condition is put in the WHERE clause, it is evaluated at the end, so all rows are considered.


SELECT a.id
FROM a,b,c
WHERE a.id = b.id AND b.id = c.id
AND a.val=1 AND b.val=2 AND c.val=3;


--> do a CROSS JOIN between a,b,c. From all rows (so a size * b size * c size), keep the ones satisfying the condition.


SELECT a.id
FROM a
JOIN B ON a.id = b.id AND a.val = 1 AND b.val = 2
JOIN C ON b.id = c.id AND c.val = 3;

--> get all rows from A. Keep the rows having a.val=1. Match rows in B by id and keep only the rows having b.val=2. Using this partial result set, match rows in C by id and keep the rows having c.val=3


Using a LEFT JOIN, the difference is in the result. Suppose we have an entry in table A with no match (by ID) in table B.


SELECT * 

FROM A
LEFT JOIN B ON a.id = b.id
WHERE b.val =2;

--> the row that exists only in A is kept in the join. The WHERE clause filters it out.


SELECT * 
FROM A
LEFT JOIN B ON a.id = b.id AND b.val = 2;

--> There is no row in B matching the row ID and b.val, so the right side of the join is NULL. Since there is a row on the left side of the join, the row is returned (A.* is populated, B.* is null)



No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?

Is there a way to save the output JPG, changing the output file name to the page name, instead of page number? I mean changing the script fo...