Re: Unclamped row estimates whith OR-ed subplans

"Benjamin Coutu" <ben.coutu@xxxxxxxxx> · Fri, 19 Jun 2020 19:33:31 +0200

> No.  The subplan estimates are for the number of rows produced by one
> execution of the subplan, ie the numbers of "accounts" or "contracts"
> rows that match those inner WHERE conditions.  This has very little
> a-priori relationship to the number of "transactions" rows that will
> satisfy the outer WHERE condition.  If we knew that transactions.account
> and transactions.contract were unique columns, then we might be able
> to say that there shouldn't be more than one outer match per subplan
> result row ... but you didn't say that, and it seems unlikely.

Thanks Tom, I understand your point.

I don't want to waste your time but maybe there is room for improvement as both "account" and "contract" are highly distinct and the individual subplan estimates are quite accurate:

Seq Scan on transactions  (cost=67.81..171458.63 rows=1301316 width=1206) (actual time=69.418..917.594 rows=112 loops=1)
  Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
  Rows Removed by Filter: 1792937
  SubPlan 1
    ->  Bitmap Heap Scan on accounts  (cost=33.96..61.76 rows=46 width=4) (actual time=3.053..3.292 rows=111 loops=1)
          Recheck Cond: ((name)::text ~~* '%test%'::text)
          Rows Removed by Index Recheck: 4
          Heap Blocks: exact=104
          ->  Bitmap Index Scan on s_accounts  (cost=0.00..33.95 rows=46 width=0) (actual time=0.505..0.505 rows=118 loops=1)
                Index Cond: ((name)::text ~~* '%test%'::text)
  SubPlan 2
    ->  Seq Scan on contracts  (cost=0.00..5.93 rows=5 width=4) (actual time=2.531..2.836 rows=4 loops=1)
          Filter: ((name)::text ~~* '%test%'::text)
          Rows Removed by Filter: 272

For comparison here are the plans for the queries with the individual where clauses:

SELECT * FROM "transactions" WHERE "account" IN (SELECT "ID" FROM "accounts" WHERE "name" ~~* '%test%')

Nested Loop  (cost=34.38..488.93 rows=155 width=1206) (actual time=0.599..1.393 rows=112 loops=1)
  ->  Bitmap Heap Scan on accounts  (cost=33.96..61.76 rows=46 width=4) (actual time=0.541..0.796 rows=111 loops=1)
        Recheck Cond: ((name)::text ~~* '%test%'::text)
        Rows Removed by Index Recheck: 4
        Heap Blocks: exact=104
        ->  Bitmap Index Scan on s_accounts  (cost=0.00..33.95 rows=46 width=0) (actual time=0.521..0.521 rows=118 loops=1)
              Index Cond: ((name)::text ~~* '%test%'::text)
  ->  Index Scan using fk_transactions_account on transactions  (cost=0.43..9.08 rows=21 width=1206) (actual time=0.004..0.005 rows=1 loops=111)
        Index Cond: (account = accounts."ID")

SELECT * FROM "transactions" WHERE "contract" IN (SELECT "ID" FROM "contracts" WHERE "name" ~~* '%test%')

Nested Loop  (cost=3.76..10.10 rows=31662 width=1206) (actual time=0.082..0.082 rows=0 loops=1)
  ->  Bitmap Heap Scan on contracts  (cost=3.64..5.74 rows=5 width=4) (actual time=0.069..0.075 rows=4 loops=1)
        Recheck Cond: ((name)::text ~~* '%test%'::text)
        Heap Blocks: exact=2
        ->  Bitmap Index Scan on s_contracts  (cost=0.00..3.64 rows=5 width=0) (actual time=0.060..0.060 rows=4 loops=1)
              Index Cond: ((name)::text ~~* '%test%'::text)
  ->  Index Scan using fk_transactions_contract on transactions  (cost=0.12..0.86 rows=1 width=1206) (actual time=0.001..0.001 rows=0 loops=4)
        Index Cond: (contract = contracts."ID")

The statistics for the columns are:

SELECT attname, null_frac, n_distinct from pg_stats WHERE tablename = 'transactions' AND attname IN ('account', 'contract')

transactions.account:   null_frac=0.025 n_distinct=80277
transactions.contract:  null_frac=1     n_distinct=0     (there are basically no non-null values for field "contract" in transactions)

According to pg_class.reltuples the table "transactions" has 1735088 rows.

I'd naively expect the selectivity for an OR of those two hashed subplans given uniform distribution to be:

rows_total = 
	((rows_transactions * (1 - null_frac_account)) / n_distinct_account) * expected_rows_from_subplan1 +
	((rows_transactions * (1 - null_frac_contract)) / n_distinct_contract) * expected_rows_from_subplan2

=> rows_total = 
	((1735088 * (1 - 0.025)) / 80277) * 46 +
	((1735088 * (1 - 1)) / 0) * 5

=> rows_total = 969 + 0 /* no non-null values for contract field */

Please forgive the sloppy math but something along this line could be promising.

Btw, I don't quite understand why the nested loop on contract only is expected to yield 31662 rows, when the null_frac of field transactions.contract is 1. Shouldn't that indicate zero rows or some kind of default minimum estimate for that query?

Thanks again!

Benjamin Coutu