Search Postgresql Archives

Using an index for IS DISTINCT FROM queries

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi everyone,
I have a large table (~150M rows) that keeps a version field.  At any given time, it is expected that the vast majority of the rows are on the "current" version, but some may be missing.

To figure out all the missing our outdated values, I run a query along the lines of 

SELECT id FROM source_table LEFT OUTER JOIN dest_table WHERE version IS DISTINCT FROM <current-version>

However, this query always selects a sequential scan and hash of both tables, which is *very* slow.

The statistics reflect that the planner knows that current-version is overwhelmingly common:

 schemaname | tablename |     attname      | inherited | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation
------------+-----------+------------------+-----------+-----------+-----------+------------+------------------+-------------------+------------------+-------------
 public     | event     | unpacker_version | f         |         0 |         4 |          1 | {1}              | {1}               |                  |           1

but it doesn't help:

event=> explain select count(1) from event where unpacker_version is distinct from 1;
                            QUERY PLAN                            
------------------------------------------------------------------
 Aggregate  (cost=10658841.93..10658841.94 rows=1 width=0)
   ->  Seq Scan on event  (cost=0.00..10658841.93 rows=1 width=0)
         Filter: (unpacker_version IS DISTINCT FROM 1)
(3 rows)


I can "by hand" force the planner to consider the obvious solution (find values below, above, and null) but it is ugly:

event=> explain select count(1) from event where unpacker_version < 1 or unpacker_version > 1 or unpacker_version is null;
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 Aggregate  (cost=139.14..139.15 rows=1 width=0)
   ->  Bitmap Heap Scan on event  (cost=135.13..139.14 rows=1 width=0)
         Recheck Cond: ((unpacker_version < 1) OR (unpacker_version > 1) OR (unpacker_version IS NULL))
         ->  BitmapOr  (cost=135.13..135.13 rows=1 width=0)
               ->  Bitmap Index Scan on event_unpacker_version_idx  (cost=0.00..45.04 rows=1 width=0)
                     Index Cond: (unpacker_version < 1)
               ->  Bitmap Index Scan on event_unpacker_version_idx  (cost=0.00..45.04 rows=1 width=0)
                     Index Cond: (unpacker_version > 1)
               ->  Bitmap Index Scan on event_unpacker_version_idx  (cost=0.00..45.04 rows=1 width=0)
                     Index Cond: (unpacker_version IS NULL)
(10 rows)

but this sucks to do such tuning by munging the query.  Is there some case where these are not equivalent?  If they are equivalent, would this be a reasonable case for the planner to consider and optimize by itself?

Thanks,
Steven

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux