Cardinality estimate of the inner relation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



My colleague Christophe Courtois and I have been trying to fix a bad plan for one of Dalibo's clients. It is a (probably well-known) problem with skewed data and a parameterized Nested Loop with an underestimation of the cardinality of the inner relation.

Here is a test case (the script to create and populate the two tables is at the end):

EXPLAIN (ANALYZE, SETTINGS)
  SELECT name, sum(quantity)
  FROM products p
  JOIN orders o ON (p.id = o.product_id)
  WHERE p.name = 'babar'        /*    28% of orders    */
  GROUP BY name;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=5.50..1022.27 rows=1 width=40) (actual time=66.213..66.215 rows=1 loops=1) -> Nested Loop (cost=5.50..1019.29 rows=1189 width=36) (actual time=0.083..56.991 rows=320000 loops=1) -> Bitmap Heap Scan on products p (cost=5.08..250.17 rows=85 width=40) (actual time=0.058..0.153 rows=80 loops=1)
               Recheck Cond: (name = 'babar'::text)
               Heap Blocks: exact=80
-> Bitmap Index Scan on products_name_idx (cost=0.00..5.06 rows=85 width=0) (actual time=0.030..0.031 rows=80 loops=1)
                     Index Cond: (name = 'babar'::text)
-> Index Scan using orders_product_id_idx on orders o (cost=0.43..8.79 rows=26 width=12) (actual time=0.006..0.440 rows=4000 loops=80)
               Index Cond: (product_id = p.id)
 Planning Time: 0.560 ms
 Execution Time: 66.268 ms

The cardinaly estimate on 'orders' is wrong (26 rows against 4000), hence the Nested Loop. Any other value for product.name gives less rows and a good plan (i.e. NL makes more sense).

If I'm not wrong, the row count estimate for the inner table is computed as follow:

SELECT n_distinct AS product_id_n_distinct FROM pg_stats
  WHERE tablename = 'orders' AND attname = 'product_id' \gset

SELECT reltuples AS orders_tuples FROM pg_class
  WHERE relname = 'orders' \gset

SELECT :orders_tuples / :product_id_n_distinct AS inner_estimation;
  inner_estimation
---------------------
 26.4049450290190157

I have two questions:

1) Is there a known workaround for this problem (besides unatural partitioning of the inner table)?

2) Would it be possible to add cross-table statistics to fix this? Has this been discussed?


Here is the script to create and populate the two tables:

 SELECT 80000 AS nbp \gset

DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS products;

CREATE UNLOGGED TABLE products(
        id bigint PRIMARY KEY,
        name TEXT,
        price INT
);

CREATE UNLOGGED TABLE orders(
        id bigint generated always as identity,
        product_id BIGINT, -- REFERENCES products(id),
        quantity INT
);

INSERT INTO products (id, name, price)
  SELECT
    i,
    CASE WHEN i % 1000 = 0 THEN 'babar' ELSE md5(i::text) END,
    random() * 1000
  FROM generate_series(1, :nbp) AS T(i);

INSERT INTO orders (product_id, quantity)
  SELECT product_id, quantity FROM (
    SELECT
      id AS product_id,
      random() * 10 AS quantity,
      name
    FROM products,
LATERAL (SELECT * FROM generate_series (1, CASE WHEN name = 'babar' THEN 4000 ELSE 10 END)) T) U;

CREATE INDEX ON orders (product_id);
VACUUM ANALYZE products, orders;

ALTER TABLE orders ADD FOREIGN KEY (product_id) REFERENCES products(id);

CREATE INDEX ON products (name);





[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux