Why am I getting great/terrible estimates with these CTE queries?

Tomas Vondra <tv@xxxxxxxx> · Wed, 10 Oct 2012 00:21:31 +0200

Hi,

I've been fighting with some CTE queries recently, and in the end I've
ended up with two basic cases. In one case the CTEs work absolutely
great, making the estimates much more precise, while in the other the
results are pretty terrible. And I'm not sure why both of these behave
the way they do.

I'm aware of the "optimization fence" and I suspect it's the cause here,
but I wonder why in one case it improves the estimates so much while in
the other one it performs so bad.

If anyone could explain what's happening, it'd be great.

The application is using a 'calendar table', with one row for each day
between year 1900 and 2050 (i.e. ~55000 rows) and some additional
columns about the date (month, quarter, week, ...) whatever. All these
are sequences of numbers starting at 1900-01-01. So the table looks like
this:

  id | month | week
--------------------
   1 | 0     | 0
   2 | 0     | 0
   .....
   7 | 0     | 1 <- switch to 2nd week
   .....
  31 | 0
  32 | 1 <- switch to February 1900

and so on. Let's go with 'month' only, so the table looks like this

  CREATE TABLE calendar_table (
      id INT PRIMARY KEY,
      month INT
  );

and let's fill it with data - for simplicity let's suppose all months
have 30 days:

  INSERT INTO calendar_table(id, month)
       SELECT i, (i/30) FROM generate_series(1,55000) s(i);

The second table is a "fact" table containing primary key, date (a FK to
the calendar table) and some additional data (facts), but we don't need
them so let's use just this table

  CREATE TABLE fact_table (
      id INT PRIMARY KEY,
      date_id INT REFERENCES calendar_table (id)
  );

Now let's fill the fact table with data for a very short period of time,
e.g. one month. Let's find what days are min/max for month 1000 (which
is ~1983):

  SELECT min(id), max(id) FROM calendar_table WHERE month = 1000;

    min  |  max
  -------+-------
   30000 | 30029

and let's fill some data randomly (1000000 rows):

  INSERT INTO fact_table(id, date_id)
       SELECT i, floor(30000 + random()*30)
         FROM generate_series(1,1000000) s(i);

Analyze the tables, and we're ready for the two queries.

(A) significant improvement of estimates / performance

A usual query is "do something with data for month X" so let's select
all data from the fact table that join to month 1000.

SELECT * FROM  fact_table f JOIN calendar_table c ON (date_id = c.id)
         WHERE month = 1000;

We do know it should be 1000000 rows, but the explain plan shows an
estimate of just 527 (http://explain.depesz.com/s/Q4oy):

                             QUERY PLAN
---------------------------------------------------------------------
 Hash Join  (cost=9.18..23110.45 rows=527 width=49)
   Hash Cond: (f.date_id = c.id)
   ->  Seq Scan on fact_table f  (cost=0.00..19346.00 rows=1000000 ...)
   ->  Hash  (cost=8.82..8.82 rows=29 width=8)
         ->  Index Scan using calendar_table_idx on calendar_table c
                                      (cost=0.00..8.82 rows=29 width=8)
               Index Cond: (month = 1000)
(6 rows)

Now, I'm pretty sure I understand where the estimate comes from. The
month is ~0.055% of the calendar table, and by applying this to the fact
table we do get ~550 rows. That's almost exactly the estimate.

Now, let's move the calendar table to a CTE (incl. the condition):

WITH c as (SELECT * from calendar_table WHERE month = 1000)
SELECT * FROM fact_table f JOIN c ON (f.date_id = c.id);

This gives us this plan (http://explain.depesz.com/s/5k9)

                             QUERY PLAN
---------------------------------------------------------------------
 Hash Join  (cost=9.76..32772.43 rows=966667 width=49)
   Hash Cond: (f.date_id = c.id)
   CTE c
     ->  Index Scan using calendar_table_idx on calendar_table
                                    (cost=0.00..8.82 rows=29 width=8)
           Index Cond: (month = 1000)
   ->  Seq Scan on fact_table f  (cost=0.00..19346.00 rows=1000000 ...)
   ->  Hash  (cost=0.58..0.58 rows=29 width=8)
         ->  CTE Scan on c  (cost=0.00..0.58 rows=29 width=8)
(8 rows)

Now, this gives us much better plan (almost exactly the actual number of
rows). The queries are usually more complex (more joins, aggregations
etc.) and this precise estimate significantly improves the performance.

I don't get it - how could it get so much more precise estimate? I'd
expect such behavior e.g. from temporary tables (because they may be
anaklyzed), but that's not how CTE work internally - the materialization
does not allow gathering stats prior to planning AFAIK.

Now let's see the opposite direction.

(B) significant deviation of estimates / performance

Let's work with the fact table only - we've seen queries where narrow
"noodles" of the fact table (basically a PK+column) were selected and
then joined using the PK. Yes, it's a dumb thing to do but that's not
the point. Let's see how this performs with CTEs

WITH a AS (SELECT * from fact_table), b AS (SELECT * from fact_table)
 SELECT * from a JOIN b on (a.id = b.id);

The explain plan is here (http://explain.depesz.com/s/zGy) - notice the
estimate which is ~5000x overestimating the actual number 1000000
(because we're joining over a PK).

                             QUERY PLAN
-----------------------------------------------------------------------
 Merge Join  (cost=278007.69..75283007.69 rows=5000000000 width=80)
   Merge Cond: (a.id = b.id)
   CTE a
     ->  Seq Scan on fact_table  (cost=0.00..19346.00 rows=1000000 ...)
   CTE b
     ->  Seq Scan on fact_table  (cost=0.00..19346.00 rows=1000000 ...)
   ->  Sort  (cost=119657.84..122157.84 rows=1000000 width=40)
         Sort Key: a.id
         ->  CTE Scan on a  (cost=0.00..20000.00 rows=1000000 width=40)
   ->  Sort  (cost=119657.84..122157.84 rows=1000000 width=40)
         Sort Key: b.id
         ->  CTE Scan on b  (cost=0.00..20000.00 rows=1000000 width=40)
(12 rows)

Now let's try without the CTEs:

 SELECT * from (SELECT * from fact_table) a
                  JOIN (SELECT * from fact_table) b on (a.id = b.id);

That gives us this plan (http://explain.depesz.com/s/Fmv)

                              QUERY PLAN
-----------------------------------------------------------------------
 Merge Join  (cost=0.00..85660.70 rows=1000000 width=82)
   Merge Cond: (public.fact_table.id = public.fact_table.id)
   ->  Index Scan using fact_table_pkey on fact_table
                            (cost=0.00..35330.35 rows=1000000 width=41)
   ->  Index Scan using fact_table_pkey on fact_table
                            (cost=0.00..35330.35 rows=1000000 width=41)
(4 rows)

Now we're getting much better estimates - exactly the right number.

Both queries might be improved (especially the second one, which may be
rewritten as a plain select without a join), but that's not the point
here. Why does the CTE improve the estimates so much in the first
example and hurts in the second one?

thanks
Tomas

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance