Re: Very specialised query

Matthew Wakeling <matthew@xxxxxxxxxxx> · Mon, 30 Mar 2009 16:57:16 +0100 (BST)

Shouldn't Postgres favour a "between" index scan over an open-ended 
one?

On Fri, 27 Mar 2009, Tom Lane wrote:
Currently the planner only notices that for a range check that involves
comparisons of the same variable expression to two constants (or
pseudoconstants anyway).  In principle it might be reasonable to have a
heuristic that reduces the estimated selectivity in the example above,
but it looks to me like it'd make clauselist_selectivity() a lot slower
and more complicated.  When you see (l1.end > l2.start), how do you know
which variable to try to match up against others?  And if you try to
match both, what do you do when you get matches for both?

Arguably, having multiple "greater than" constraints on a field should not 
affect the selectivity much, because the separate constraints will all be 
throwing away the same set of rows. So having a single "greater than" will 
halve the number of rows, while two "greater than"s will divide the number 
of rows by three, etc. That's a vast approximation of course, and should 
be skewed by the statistics.

However, combining a "greater than" with a "less than" has a different 
effect on selectivity. If the numbers were completely random with 
identical value spread in each field, then a single "greater than" would 
halve the number of rows and an additional "less than" would halve the 
number again. However, in most real life situations the values will not be 
completely random, but will be very skewed, as in our case. Unless the 
planner starts collecting statistics on the standard distribution of the 
difference between two fields, that will be hard to account for though.

Have a look at the following EXPLAIN ANALYSE:

SELECT
    l1.id AS id1,
    l2.id AS id2
FROM
    location l1,
    location l2
WHERE
        l1.objectid = 228000093
    AND l2.objectid = 228000093
    AND l1.id <> l2.id
    AND l1.start < l2.end
    AND l1.end > l2.start
    AND l1.start < l2.start
UNION ALL 
SELECT
    l1.id AS id1,
    l2.id AS id2
FROM
    location l1,
    location l2
WHERE
        l1.objectid = 228000093
    AND l2.objectid = 228000093
    AND l1.id <> l2.id
    AND l1.start < l2.end
    AND l1.end > l2.start
    AND l1.start >= l2.start;

                                          QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..20479179.74 rows=138732882 width=8)
         (actual time=0.055..2362748.298 rows=258210 loops=1)
   ->  Nested Loop  (cost=0.00..9545925.46 rows=69366441 width=8)
                    (actual time=0.053..627.038 rows=99436 loops=1)
         Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end))
         ->  Index Scan using location_test_obj_end on location l1
                         (cost=0.00..55966.07 rows=43277 width=12)
                         (actual time=0.025..58.604 rows=43534 loops=1)
               Index Cond: (objectid = 228000093)
         ->  Index Scan using location_test_obj_start on location l2
                         (cost=0.00..123.10 rows=4809 width=12)
                         (actual time=0.003..0.006 rows=2 loops=43534)
               Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start < l2.start))
   ->  Nested Loop  (cost=0.00..9545925.46 rows=69366441 width=8)
                    (actual time=0.041..2361632.540 rows=158774 loops=1)
         Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end))
         ->  Index Scan using location_test_obj_end on location l1
                         (cost=0.00..55966.07 rows=43277 width=12)
                         (actual time=0.009..80.814 rows=43534 loops=1)
               Index Cond: (objectid = 228000093)
         ->  Index Scan using location_test_obj_start on location l2
                         (cost=0.00..123.10 rows=4809 width=12)
                         (actual time=0.012..29.823 rows=21768 loops=43534)
               Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start >= l2.start))
 Total runtime: 2363015.959 ms
(14 rows)

Note again the two leaf index scans that really matter for performance. On 
one of them, there is a difference, and the other is open ended.

               Expected rows     Seen rows
difference       4809                  2
open-ended       4809              21768

The first query fragment takes 700ms to execute, and the second takes 
about forty minutes. It is effectively doing a nested loop join with 
hardly any benefit gained from the indexes at all, over a simple index on 
objectid. I may as well make the query a lot simpler, and do:

SELECT
    l1.id AS id1,
    l2.id AS id2
FROM
    location l1,
    location l2
WHERE
        l1.objectid = 228000093
    AND l2.objectid = 228000093
    AND l1.id <> l2.id
    AND l1.start < l2.end
    AND l1.end > l2.start

Unless this particular issue is improved in the planner, I don't think 
this particular style of query is going to work for us. I know that the 
database could in theory return a result in about 1400ms. I'll see how 
close to that I can get with plpgsql.

Matthew

--
First law of computing:  Anything can go wro
sig: Segmentation fault.  core dumped.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance