Thinking About Correlated Columns (again)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi!

This has been a pain point for quite a while. While we've had several discussions in the area, it always seems to just kinda trail off and eventually vanish every time it comes up.

A really basic example of how bad the planner is here:

CREATE TABLE foo AS
SELECT a.id, a.id % 1000 AS col_a, a.id % 1000 AS col_b
  FROM generate_series(1, 1000000) a(id);

CREATE INDEX idx_foo_ab ON foo (col_a, col_b);

ANALYZE foo;

EXPLAIN ANALYZE
SELECT *
  FROM foo
 WHERE col_a = 50
   AND col_b = 50;

Index Scan using idx_foo_ab on foo  (cost=0.00..6.35 rows=1 width=12)
           (actual time=0.030..3.643 rows=1000 loops=1)
  Index Cond: ((col_a = 50) AND (col_b = 50))

Hey, look! The row estimate is off by a factor of 1000. This particular case doesn't suffer terribly from the mis-estimation, but others do. Boy, do they ever.

Since there really is no fix for this aside from completely rewriting the query or horribly misusing CTEs (to force sequence scans instead of bloated nested loops, for example), I recently tried to use some existing ALTER TABLE syntax:

ALTER TABLE foo ALTER col_a SET (n_distinct = 1);
ALTER TABLE foo ALTER col_b SET (n_distinct = 1);

The new explain output:

Index Scan using idx_foo_ab on foo  (cost=0.00..9.37 rows=2 width=12)
           (actual time=0.013..0.871 rows=1000 loops=1)
  Index Cond: ((col_a = 50) AND (col_b = 50))

Well... 2 is closer to 1000 than 1 was, but it's pretty clear the histogram is still the basis for the cost estimates. I'm curious what benefit overriding the n_distinct pg_stats field actually gives us in practice.

I'm worried that without an easy answer for cases like this, more DBAs will abuse optimization fences to get what they want and we'll end up in a scenario that's actually worse than query hints. Theoretically, query hints can be deprecated or have GUCs to remove their influence, but CTEs are forever, or until the next code refactor.

I've seen conversations on this since at least 2005. There were even proposed patches every once in a while, but never any consensus. Anyone care to comment?

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@xxxxxxxxxxxxxxxx

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance




[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux