Robert Haas <robertmhaas@xxxxxxxxx> writes: > On Thu, Jan 20, 2011 at 2:05 AM, Achilleas Mantzios > <achill@xxxxxxxxxxxxxxxxxxxxx> wrote: >> -> Hash Semi Join (cost=2768.00..5671.67 rows=1 width=12) (actual time=39.249..81.025 rows=1876 loops=1) >> Hash Cond: (msold.marinerid = msold2.marinerid) >> Join Filter: ((msold2.id <> msold.id) AND (msold2.starttime < msold.starttime) AND ((msold.starttime - msold2.endtime) <= '1 year 6 mons'::interval)) > Looks like the bad selectivity estimate there is what's killing it. > Not sure I completely understand why 9.0.2 is coming up with such a > bad estimate, though. Hm ... it's the <> clause. Look at this, in the regression database: regression=# explain analyze select * from tenk1 a where exists(select 1 from tenk1 b where a.hundred = b.hundred); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Nested Loop Semi Join (cost=0.00..1134.65 rows=10000 width=244) (actual time=0.362..960.732 rows=10000 loops=1) -> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=244) (actual time=0.070..45.287 rows=10000 loops=1) -> Index Scan using tenk1_hundred on tenk1 b (cost=0.00..2.16 rows=100 width=4) (actual time=0.073..0.073 rows=1 loops=10000) Index Cond: (hundred = a.hundred) Total runtime: 996.990 ms (5 rows) regression=# explain analyze select * from tenk1 a where exists(select 1 from tenk1 b where a.hundred = b.hundred and a.thousand <> b.thousand); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Hash Semi Join (cost=583.00..1078.50 rows=1 width=244) (actual time=142.738..344.823 rows=10000 loops=1) Hash Cond: (a.hundred = b.hundred) Join Filter: (a.thousand <> b.thousand) -> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=244) (actual time=0.051..44.137 rows=10000 loops=1) -> Hash (cost=458.00..458.00 rows=10000 width=8) (actual time=142.526..142.526 rows=10000 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 313kB -> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=8) (actual time=0.027..71.778 rows=10000 loops=1) Total runtime: 384.017 ms (8 rows) (This is with enable_hashagg off, to make the two plans more obviously comparable; but that's cosmetic. The important point is that the join rowcount estimate is dead on in the first case and dead wrong in the second.) Some digging turns up the fact that the semi-join selectivity of "a.thousand <> b.thousand" is being estimated at *zero*. This is because the semi-join selectivity of "a.thousand = b.thousand" is estimated at 1.0 (correctly: every row of a has at least one join partner in b). And then neqjoinsel is computed as 1 - eqjoinsel, which is a false conclusion for semijoins: joining to at least one row doesn't mean joining to every row. I'm a bit inclined to fix this by having neqjoinsel hard-wire a result of 1 for semi and anti join cases --- that is, assume there's always at least one inner row that isn't equal to the outer row. That's presumably too high for real-world cases where the clause is probably being used together with other, correlated, clauses; but we've got no info available that would help narrow that down. The best we can do here is a forced estimate. If it should be less than 1, then what? regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance