Re: Unclamped row estimates whith OR-ed subplans

Tom Lane <tgl@xxxxxxxxxxxxx> · Fri, 19 Jun 2020 14:16:49 -0400

"Benjamin Coutu" <ben.coutu@xxxxxxxxx> writes:
> I don't want to waste your time but maybe there is room for improvement as both "account" and "contract" are highly distinct and the individual subplan estimates are quite accurate:

Yeah, as I said, the estimates you're getting for the OR'd subplans are
pretty stupid.  Once you throw the OR in there, it's not possible to
convert the IN clauses to semi-joins, so they just stay as generic
subplans.  It looks like we have exactly zero intelligence about the
generic case --- unless I'm missing something in clause_selectivity,
you just end up with a default 0.5 selectivity estimate.  So yeah,
there's a lot of room for improvement, whenever anyone finds some
round tuits to work on that.

While you're waiting, you might think about recasting the query to
avoid the OR.  Perhaps you could do a UNION of two scans of the
transactions table?

> Btw, I don't quite understand why the nested loop on contract only is expected to yield 31662 rows, when the null_frac of field transactions.contract is 1. Shouldn't that indicate zero rows or some kind of default minimum estimate for that query?

That I don't understand.  I get a minimal rowcount estimate for an
all-nulls outer table, as long as I'm using just one IN rather than
an OR:

regression=# create table contracts (id int);
CREATE TABLE
regression=# insert into contracts values(1),(2),(3),(4);
INSERT 0 4
regression=# analyze contracts ;
ANALYZE
regression=# create table transactions (contract int);
CREATE TABLE
regression=# insert into transactions select null from generate_series(1,100000);
INSERT 0 100000
regression=# analyze transactions;
ANALYZE
regression=# explain select * from transactions where contract in (select id from contracts);
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Hash Semi Join  (cost=1.09..1607.59 rows=1 width=4)
   Hash Cond: (transactions.contract = contracts.id)
   ->  Seq Scan on transactions  (cost=0.00..1344.00 rows=100000 width=4)
   ->  Hash  (cost=1.04..1.04 rows=4 width=4)
         ->  Seq Scan on contracts  (cost=0.00..1.04 rows=4 width=4)
(5 rows)

			regards, tom lane