Bad estimation for "where field not in"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

We have a table with about 60M records, almost all of which in one of
two statuses ('done', 'failed') and a few of them, usually < 1000, in
different transient statuses. We also have a partial index indexing
the transient items: where status not in ('done', 'failed'). Stats are
about right for the status field:

  stavalues1   |     stanumbers1
---------------+---------------------
 {done,failed} | {0.541767,0.458233}

Trying to access only the transient items leads to very bad records
estimations, and the choice of poor query plans, as the estimate is to
read some 25% of the items table (so joins are often performed against
full scans of other large tables instead of using indexes):

explain analyze select count(*) from items where status not in
('done', 'failed');

 Aggregate  (cost=2879903.86..2879903.87 rows=1 width=0) (actual
time=0.460..0.461 rows=1 loops=1)
   ->  Bitmap Heap Scan on items  (cost=3674.23..2843184.08
rows=14687908 width=0) (actual time=0.393..0.453 rows=20 loops=1)
         Recheck Cond: (((status)::text <> 'done'::text) AND
((status)::text <> 'failed'::text))
         ->  Bitmap Index Scan on i_items_transient_status
(cost=0.00..2.26 rows=14687908 width=0) (actual time=0.381..0.381
rows=38 loops=1)

Looking at these estimate of the rows in the table (59164756) and the
estimate of the filtered rows (14687908), looks like the planner is
calculating the probability of the status being neither done nor
failed as two events independent events:

=# select 59164555 * (1 - 0.541767) * (1 - 0.458233);
 14687927.231665933605

while it is clear (at least in the original query specification,
before splitting the condition in two ANDed parts) that the two events
are disjoint, so the probability should be calculated as 1 - (p(done)
+ p(failed)) instead of (1 - p(done)) * (1 - p(failed)).

Writing the query without the "not", listing the other statuses, leads
to a correct plan instead.

Is this a known planner shortcoming or something unexpected, to be
escalated to -bugs? Server version is 9.0.1.

-- Daniele

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux