Cardinality estimate of the inner relation

Frédéric Yhuel <frederic.yhuel@xxxxxxxxxx> · Fri, 22 Nov 2024 16:53:39 +0100

My colleague Christophe Courtois and I have been trying to fix a bad 
plan for one of Dalibo's clients. It is a (probably well-known) problem 
with skewed data and a parameterized Nested Loop with an underestimation 
of the cardinality of the inner relation.

Here is a test case (the script to create and populate the two tables is 
at the end):

EXPLAIN (ANALYZE, SETTINGS)
  SELECT name, sum(quantity)
  FROM products p
  JOIN orders o ON (p.id = o.product_id)
  WHERE p.name = 'babar'        /*    28% of orders    */
  GROUP BY name;

QUERY PLAN 

--------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=5.50..1022.27 rows=1 width=40) (actual 
time=66.213..66.215 rows=1 loops=1)
   ->  Nested Loop  (cost=5.50..1019.29 rows=1189 width=36) (actual 
time=0.083..56.991 rows=320000 loops=1)
         ->  Bitmap Heap Scan on products p  (cost=5.08..250.17 rows=85 
width=40) (actual time=0.058..0.153 rows=80 loops=1)
               Recheck Cond: (name = 'babar'::text)
               Heap Blocks: exact=80
               ->  Bitmap Index Scan on products_name_idx 
(cost=0.00..5.06 rows=85 width=0) (actual time=0.030..0.031 rows=80 loops=1)
                     Index Cond: (name = 'babar'::text)
         ->  Index Scan using orders_product_id_idx on orders o 
(cost=0.43..8.79 rows=26 width=12) (actual time=0.006..0.440 rows=4000 
loops=80)
               Index Cond: (product_id = p.id)
 Planning Time: 0.560 ms
 Execution Time: 66.268 ms

The cardinaly estimate on 'orders' is wrong (26 rows against 4000), 
hence the Nested Loop.
Any other value for product.name gives less rows and a good plan (i.e. 
NL makes more sense).

If I'm not wrong, the row count estimate for the inner table is computed 
as follow:

SELECT n_distinct AS product_id_n_distinct FROM pg_stats
  WHERE tablename = 'orders' AND attname = 'product_id' \gset

SELECT reltuples AS orders_tuples FROM pg_class
  WHERE relname = 'orders' \gset

SELECT :orders_tuples / :product_id_n_distinct AS inner_estimation;
  inner_estimation
---------------------
 26.4049450290190157

I have two questions:

1) Is there a known workaround for this problem (besides unatural 
partitioning of the inner table)?

2) Would it be possible to add cross-table statistics to fix this? Has 
this been discussed?

Here is the script to create and populate the two tables:

 SELECT 80000 AS nbp \gset

DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS products;

CREATE UNLOGGED TABLE products(
        id bigint PRIMARY KEY,
        name TEXT,
        price INT
);

CREATE UNLOGGED TABLE orders(
        id bigint generated always as identity,
        product_id BIGINT, -- REFERENCES products(id),
        quantity INT
);

INSERT INTO products (id, name, price)
  SELECT
    i,
    CASE WHEN i % 1000 = 0 THEN 'babar' ELSE md5(i::text) END,
    random() * 1000
  FROM generate_series(1, :nbp) AS T(i);

INSERT INTO orders (product_id, quantity)
  SELECT product_id, quantity FROM (
    SELECT
      id AS product_id,
      random() * 10 AS quantity,
      name
    FROM products,
    LATERAL (SELECT * FROM generate_series (1, CASE WHEN name = 'babar' 
THEN 4000 ELSE 10 END)) T) U;

CREATE INDEX ON orders (product_id);
VACUUM ANALYZE products, orders;

ALTER TABLE orders ADD FOREIGN KEY (product_id) REFERENCES products(id);

CREATE INDEX ON products (name);