generate_series woes

Harald Fuchs <hari.fuchs@xxxxxxxxxxxxxx> · Mon, 14 Apr 2008 11:21:58 +0200

I think there's something sub-optimal with generate_series.
In the following, "documents" is a table with more than 120000 rows,
vacuumed and analyzed before the queries.

  EXPLAIN ANALYZE
  SELECT count (d.id), floor (s.val / 5000)
  FROM generate_series (1::INT, 5009) AS s (val)
  LEFT JOIN documents d ON d.id = s.val
  GROUP BY 2
  ORDER BY 2;

This returns:

 Sort  (cost=4231.52..4232.02 rows=200 width=8) (actual time=41.886..41.887 rows=2 loops=1)
   Sort Key: (floor(((s.val / 5000))::double precision))
   Sort Method:  quicksort  Memory: 25kB
   ->  HashAggregate  (cost=4219.88..4223.88 rows=200 width=8) (actual time=41.843..41.846 rows=2 loops=1)
         ->  Nested Loop Left Join  (cost=0.00..4214.88 rows=1000 width=8) (actual time=1.274..38.193 rows=5009 loops=1)
               ->  Function Scan on generate_series s  (cost=0.00..12.50 rows=1000 width=4) (actual time=1.209..3.102 rows=5009 loops=1)
               ->  Index Scan using documents_pkey on documents d  (cost=0.00..4.18 rows=1 width=4) (actual time=0.004..0.005 rows=1 loops=5009)
                     Index Cond: (d.id = s.val)
 Total runtime: 42.218 ms

Now let's wrap generate_series into an SQL function:

  CREATE FUNCTION genser (int, int) RETURNS SETOF int AS $$
    SELECT * FROM generate_series ($1, $2) AS g(x);
  $$ LANGUAGE sql;

  EXPLAIN ANALYZE
  SELECT count (d.id), floor (s.val / 5000)
  FROM genser (1::INT, 5009) AS s (val)
  LEFT JOIN documents d ON d.id = s.val
  GROUP BY 2
  ORDER BY 2;

Not surprisingly, this returns the same plan:

 Sort  (cost=4479.02..4479.52 rows=200 width=8) (actual time=43.606..43.607 rows=2 loops=1)
   Sort Key: (floor(((s.val / 5000))::double precision))
   Sort Method:  quicksort  Memory: 25kB
   ->  HashAggregate  (cost=4467.38..4471.38 rows=200 width=8) (actual time=43.559..43.561 rows=2 loops=1)
         ->  Nested Loop Left Join  (cost=0.00..4462.38 rows=1000 width=8) (actual time=3.564..39.740 rows=5009 loops=1)
               ->  Function Scan on genser s  (cost=0.00..260.00 rows=1000 width=4) (actual time=3.503..5.435 rows=5009 loops=1)
               ->  Index Scan using documents_pkey on documents d  (cost=0.00..4.18 rows=1 width=4) (actual time=0.004..0.005 rows=1 loops=5009)
                     Index Cond: (d.id = s.val)
 Total runtime: 44.047 ms
(9 rows)

But look what happens if we tell PostgreSQL how many rows "genser"
will return:

  CREATE FUNCTION genser (int, int) RETURNS SETOF int AS $$
    SELECT * FROM generate_series ($1, $2) AS g(x);
  $$ LANGUAGE sql ROWS 5009;

  EXPLAIN ANALYZE
  SELECT count (d.id), floor (s.val / 5000)
  FROM genser (1::INT, 5009) AS s (val)
  LEFT JOIN documents d ON d.id = s.val
  GROUP BY 2
  ORDER BY 2;

Now we get a better plan:

 Sort  (cost=15545.54..15546.04 rows=200 width=8) (actual time=27.857..27.859 rows=2 loops=1)
   Sort Key: (floor(((s.val / 5000))::double precision))
   Sort Method:  quicksort  Memory: 25kB
   ->  HashAggregate  (cost=15533.89..15537.89 rows=200 width=8) (actual time=27.817..27.819 rows=2 loops=1)
         ->  Merge Right Join  (cost=1610.15..15508.85 rows=5009 width=8) (actual time=7.714..24.133 rows=5009 loops=1)
               Merge Cond: (d.id = s.val)
               ->  Index Scan using documents_pkey on documents d  (cost=0.00..13472.20 rows=125518 width=4) (actual time=0.045..6.112 rows=5010 loops=1)
               ->  Sort  (cost=1610.15..1622.67 rows=5009 width=4) (actual time=7.651..9.501 rows=5009 loops=1)
                     Sort Key: s.val
                     Sort Method:  quicksort  Memory: 427kB
                     ->  Function Scan on genser s  (cost=0.00..1302.34 rows=5009 width=4) (actual time=3.559..5.262 rows=5009 loops=1)
 Total runtime: 28.445 ms
(12 rows)

Since generate_series is a builtin function, can't it tell how many
rows it will return?