Re: "NOT IN" substantially slower in 9.0.2 than 8.3.13 - NOT EXISTS runs fast in both 8.3.13 and 9.0.2

Tom Lane <tgl@xxxxxxxxxxxxx> · Fri, 21 Jan 2011 14:26:46 -0500

Robert Haas <robertmhaas@xxxxxxxxx> writes:
> On Thu, Jan 20, 2011 at 2:05 AM, Achilleas Mantzios
> <achill@xxxxxxxxxxxxxxxxxxxxx> wrote:
>>     ->  Hash Semi Join  (cost=2768.00..5671.67 rows=1 width=12) (actual time=39.249..81.025 rows=1876 loops=1)
>>           Hash Cond: (msold.marinerid = msold2.marinerid)
>>           Join Filter: ((msold2.id <> msold.id) AND (msold2.starttime < msold.starttime) AND ((msold.starttime - msold2.endtime) <= '1 year 6 mons'::interval))

> Looks like the bad selectivity estimate there is what's killing it.
> Not sure I completely understand why 9.0.2 is coming up with such a
> bad estimate, though.

Hm ... it's the <> clause.  Look at this, in the regression database:

regression=# explain analyze select * from tenk1 a where exists(select 1 from tenk1 b where a.hundred = b.hundred);
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Semi Join  (cost=0.00..1134.65 rows=10000 width=244) (actual time=0.362..960.732 rows=10000 loops=1)
   ->  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.070..45.287 rows=10000 loops=1)
   ->  Index Scan using tenk1_hundred on tenk1 b  (cost=0.00..2.16 rows=100 width=4) (actual time=0.073..0.073 rows=1 loops=10000)
         Index Cond: (hundred = a.hundred)
 Total runtime: 996.990 ms
(5 rows)

regression=# explain analyze select * from tenk1 a where exists(select 1 from tenk1 b where a.hundred = b.hundred and a.thousand <> b.thousand);
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Hash Semi Join  (cost=583.00..1078.50 rows=1 width=244) (actual time=142.738..344.823 rows=10000 loops=1)
   Hash Cond: (a.hundred = b.hundred)
   Join Filter: (a.thousand <> b.thousand)
   ->  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=10000 width=244) (actual time=0.051..44.137 rows=10000 loops=1)
   ->  Hash  (cost=458.00..458.00 rows=10000 width=8) (actual time=142.526..142.526 rows=10000 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 313kB
         ->  Seq Scan on tenk1 b  (cost=0.00..458.00 rows=10000 width=8) (actual time=0.027..71.778 rows=10000 loops=1)
 Total runtime: 384.017 ms
(8 rows)

(This is with enable_hashagg off, to make the two plans more obviously
comparable; but that's cosmetic.  The important point is that the join
rowcount estimate is dead on in the first case and dead wrong in the
second.)

Some digging turns up the fact that the semi-join selectivity of
"a.thousand <> b.thousand" is being estimated at *zero*.  This is
because the semi-join selectivity of "a.thousand = b.thousand" is
estimated at 1.0 (correctly: every row of a has at least one join
partner in b).  And then neqjoinsel is computed as 1 - eqjoinsel,
which is a false conclusion for semijoins: joining to at least one row
doesn't mean joining to every row.

I'm a bit inclined to fix this by having neqjoinsel hard-wire a result
of 1 for semi and anti join cases --- that is, assume there's always
at least one inner row that isn't equal to the outer row.  That's
presumably too high for real-world cases where the clause is probably
being used together with other, correlated, clauses; but we've got no
info available that would help narrow that down.  The best we can do
here is a forced estimate.  If it should be less than 1, then what?

			regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance