What is the precision of SELECT DISTINCT on PostGIS geometry column?

Sunday, 4 November 2018

What is the precision of SELECT DISTINCT on PostGIS geometry column?

I wonder what the precision of the SELECT DISTINCT operator is on a PostGIS geometry. On my system, the following query gives me a count of 5, which means that the inserted points are considered equal if they differ by less than 1e-5 and I am not sure if that is a feature of PostGIS, a problem of my installation or a bug.

Does anyone know if that is the expected behavior?

CREATE TEMP TABLE test (geom geometry);
INSERT INTO test
    VALUES 
        (St_GeomFromText('POINT (0.1 0.1)')),
        (St_GeomFromText('POINT (0.001 0.001)')),
        (St_GeomFromText('POINT (0.0001 0.0001)')),
        (St_GeomFromText('POINT (0.00001 0.00001)')),
        (St_GeomFromText('POINT (0.000001 0.000001)')),
        (St_GeomFromText('POINT (0.0000001 0.0000001)')),
        (St_GeomFromText('POINT (0.00000001 0.00000001)')),

        (St_GeomFromText('POINT (0.000000001 0.000000001)'));

SELECT COUNT(*) FROM (SELECT DISTINCT geom FROM test) AS test;

 count 
-------
     5
(1 row)

I am using:

$ psql --version
psql (PostgreSQL) 9.3.1

and

SELECT PostGIS_full_version();
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
POSTGIS="2.1.1 r12113" GEOS="3.4.2-CAPI-1.8.2 r3921" PROJ="Rel. 4.8.0, 6 March 2012" GDAL="GDAL 1.10.1, released 2013/08/26" LIBXML="2.7.3" LIBJSON="UNKNOWN" RASTER

on OSX 10.9

Answer

I'm surprised it's quite so coarse, but there it is. It's not DISTINCT, per se, it's the '=' operator, which is defined for geometry as 'equality of the index keys' which means practically 'equality of the 32-bit bounding boxes'.

You can see the same effect just using '=' directly,

select 'POINT (0.000000001 0.000000001)'::geometry = 'POINT (0.000000001 0.000001)'::geometry;

select 'POINT (0.000000001 0.000000001)'::geometry = 'POINT (0.000000001 0.00001)'::geometry;

Making '=' behave "intuitively" would unfortunately involve either a huge computational loss (doing explicit ST_Equals() evaluation for the operator call) or some substantial new complicated code (storing hash values for larger geometries, doing exact tests on the fly for smaller ones, choosing the right code path on the fly, etc)

And of course now lots of applications / users have internalized the existing behaviour, such as it is, so "improving" it would be a downgrade for many folks. You can do an "exact" distinct by calculating your set on ST_AsBinary(geom) instead, which will do exact equality testing on the bytea outputs.

Blog

Sunday, 4 November 2018

What is the precision of SELECT DISTINCT on PostGIS geometry column?

No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?