Re: Very specialised query

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Shouldn't Postgres favour a "between" index scan over an open-ended one?

On Fri, 27 Mar 2009, Tom Lane wrote:
Currently the planner only notices that for a range check that involves
comparisons of the same variable expression to two constants (or
pseudoconstants anyway).  In principle it might be reasonable to have a
heuristic that reduces the estimated selectivity in the example above,
but it looks to me like it'd make clauselist_selectivity() a lot slower
and more complicated.  When you see (l1.end > l2.start), how do you know
which variable to try to match up against others?  And if you try to
match both, what do you do when you get matches for both?

Arguably, having multiple "greater than" constraints on a field should not affect the selectivity much, because the separate constraints will all be throwing away the same set of rows. So having a single "greater than" will halve the number of rows, while two "greater than"s will divide the number of rows by three, etc. That's a vast approximation of course, and should be skewed by the statistics.

However, combining a "greater than" with a "less than" has a different effect on selectivity. If the numbers were completely random with identical value spread in each field, then a single "greater than" would halve the number of rows and an additional "less than" would halve the number again. However, in most real life situations the values will not be completely random, but will be very skewed, as in our case. Unless the planner starts collecting statistics on the standard distribution of the difference between two fields, that will be hard to account for though.

Have a look at the following EXPLAIN ANALYSE:

SELECT
    l1.id AS id1,
    l2.id AS id2
FROM
    location l1,
    location l2
WHERE
        l1.objectid = 228000093
    AND l2.objectid = 228000093
    AND l1.id <> l2.id
    AND l1.start < l2.end
    AND l1.end > l2.start
    AND l1.start < l2.start
UNION ALL SELECT
    l1.id AS id1,
    l2.id AS id2
FROM
    location l1,
    location l2
WHERE
        l1.objectid = 228000093
    AND l2.objectid = 228000093
    AND l1.id <> l2.id
    AND l1.start < l2.end
    AND l1.end > l2.start
    AND l1.start >= l2.start;

                                          QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..20479179.74 rows=138732882 width=8)
         (actual time=0.055..2362748.298 rows=258210 loops=1)
   ->  Nested Loop  (cost=0.00..9545925.46 rows=69366441 width=8)
                    (actual time=0.053..627.038 rows=99436 loops=1)
         Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end))
         ->  Index Scan using location_test_obj_end on location l1
                         (cost=0.00..55966.07 rows=43277 width=12)
                         (actual time=0.025..58.604 rows=43534 loops=1)
               Index Cond: (objectid = 228000093)
         ->  Index Scan using location_test_obj_start on location l2
                         (cost=0.00..123.10 rows=4809 width=12)
                         (actual time=0.003..0.006 rows=2 loops=43534)
               Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start < l2.start))
   ->  Nested Loop  (cost=0.00..9545925.46 rows=69366441 width=8)
                    (actual time=0.041..2361632.540 rows=158774 loops=1)
         Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end))
         ->  Index Scan using location_test_obj_end on location l1
                         (cost=0.00..55966.07 rows=43277 width=12)
                         (actual time=0.009..80.814 rows=43534 loops=1)
               Index Cond: (objectid = 228000093)
         ->  Index Scan using location_test_obj_start on location l2
                         (cost=0.00..123.10 rows=4809 width=12)
                         (actual time=0.012..29.823 rows=21768 loops=43534)
               Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start >= l2.start))
 Total runtime: 2363015.959 ms
(14 rows)

Note again the two leaf index scans that really matter for performance. On one of them, there is a difference, and the other is open ended.

               Expected rows     Seen rows
difference       4809                  2
open-ended       4809              21768

The first query fragment takes 700ms to execute, and the second takes about forty minutes. It is effectively doing a nested loop join with hardly any benefit gained from the indexes at all, over a simple index on objectid. I may as well make the query a lot simpler, and do:

SELECT
    l1.id AS id1,
    l2.id AS id2
FROM
    location l1,
    location l2
WHERE
        l1.objectid = 228000093
    AND l2.objectid = 228000093
    AND l1.id <> l2.id
    AND l1.start < l2.end
    AND l1.end > l2.start

Unless this particular issue is improved in the planner, I don't think this particular style of query is going to work for us. I know that the database could in theory return a result in about 1400ms. I'll see how close to that I can get with plpgsql.

Matthew

--
First law of computing:  Anything can go wro
sig: Segmentation fault.  core dumped.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux