The problem was described here earlier but there is no answers unfortunately: http://www.postgresql.org/message-id/1387780366.910542228@xxxxxxxxxxxxxx
It looks like defect.
CREATE TABLE t1 (c1 INTEGER, id INTEGER PRIMARY KEY);
INSERT INTO t1 (c1, id) SELECT b, b FROM generate_series(1, 1000000) a (b);
REINDEX TABLE t1;
ANALYZE t1;
CREATE TABLE t2 (c1 INTEGER, id INTEGER PRIMARY KEY);
INSERT INTO t2 (c1, id) SELECT b, b FROM generate_series(1, 1000000) a (b);
REINDEX TABLE t2;
ANALYZE t2;
CREATE TEMP TABLE t3 (ref_id INTEGER);
INSERT INTO t3 (ref_id) VALUES (333333), (666666);
ANALYZE t3;
test=> EXPLAIN (ANALYZE) SELECT * FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t2) t INNER JOIN t3 ON t3.ref_id = t.id;
QUERY PLAN
----------------------------------------------------------------------------------------------------
--------------------
Nested Loop (cost=0.42..34.84 rows=20000 width=12) (actual time=0.046..0.104 rows=4 loops=1)
-> Seq Scan on t3 (cost=0.00..1.02 rows=2 width=4) (actual time=0.008..0.009 rows=2 loops=1)
-> Append (cost=0.42..16.89 rows=2 width=8) (actual time=0.023..0.042 rows=2 loops=2)
-> Index Scan using t1_pkey on t1 (cost=0.42..8.45 rows=1 width=8) (actual time=0.020..0.022 rows=1 loops=2)
Index Cond: (id = t3.ref_id)
-> Index Scan using t2_pkey on t2 (cost=0.42..8.45 rows=1 width=8) (actual time=0.015..0.016 rows=1 loops=2)
Index Cond: (id = t3.ref_id)
Total runtime: 0.184 ms
(8 rows)
This plan is perfect. But the rows estimation is not: 20000 vs 4.
As I can see Pg is able to do correct rows estimation: inner append: rows = 2, outer seq scan: rows = 2. And nested loop has to know that one is able to produce 2 * 2 = 4 rows max.
Moreover the cost estimation is _correct_! It is corresponded to 'rows=4'.
Why it is important to make correct row estimation? 'Cause it does matter in more complex query.
Let's join another big table in that query:
CREATE TABLE links (c1 INTEGER PRIMARY KEY, descr TEXT);
INSERT INTO links (c1, descr) SELECT b, '2' FROM generate_series(1, 100000) a (b);
REINDEX TABLE links;
ANALYZE links;
test=> EXPLAIN (ANALYZE) SELECT * FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t2) t INNER JOIN t3 ON t3.ref_id = t.id INNER JOIN links l ON (t.c1 = l.c1);
QUERY PLAN
----------------------------------------------------------------------------------------------------
--------------------------
Hash Join (cost=2693.43..3127.84 rows=20000 width=18) (actual time=33.619..33.619 rows=0 loops=1)
Hash Cond: (t1.c1 = l.c1)
-> Nested Loop (cost=0.42..34.84 rows=20000 width=12) (actual time=0.038..0.078 rows=4 loops=1)
-> Seq Scan on t3 (cost=0.00..1.02 rows=2 width=4) (actual time=0.006..0.007 rows=2 loops=1)
-> Append (cost=0.42..16.89 rows=2 width=8) (actual time=0.017..0.029 rows=2 loops=2)
-> Index Scan using t1_pkey on t1 (cost=0.42..8.45 rows=1 width=8) (actual time=0.015..0.017 rows=1 loops=2)
Index Cond: (id = t3.ref_id)
-> Index Scan using t2_pkey on t2 (cost=0.42..8.45 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=2)
Index Cond: (id = t3.ref_id)
-> Hash (cost=1443.00..1443.00 rows=100000 width=6) (actual time=33.479..33.479 rows=100000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 3711kB
-> Seq Scan on links l (cost=0.00..1443.00 rows=100000 width=6) (actual time=0.017..14.853 rows=100000 loops=1)
Total runtime: 33.716 ms
(13 rows)
Planner thinks there'll be 20000 rows when join is performed between "t" and "t3". And that's why it makes a decision to use hash join with "links" table.
Let's prove it:
CREATE OR REPLACE FUNCTION public.f1()
RETURNS SETOF integer
LANGUAGE plpgsql
ROWS 20000
AS $function$
BEGIN
RETURN QUERY EXECUTE 'SELECT t.c1 FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t2) t INNER JOIN t3 ON t3.ref_id = t.id';
END;
$function$
test=> explain select * from f1() t(c1) INNER JOIN links l ON (t.c1 = l.c1);
QUERY PLAN
---------------------------------------------------------------------------
Hash Join (cost=2693.25..3293.25 rows=20000 width=10)
Hash Cond: (t.c1 = l.c1)
-> Function Scan on f1 t (cost=0.25..200.25 rows=20000 width=4)
-> Hash (cost=1443.00..1443.00 rows=100000 width=6)
-> Seq Scan on links l (cost=0.00..1443.00 rows=100000 width=6)
(5 rows)
The same "defect" plan.
test=> ALTER FUNCTION f1() ROWS 4;
ALTER FUNCTION
test=> explain select * from f1() t(c1) INNER JOIN links l ON (t.c1 = l.c1);
QUERY PLAN
--------------------------------------------------------------------------------
Nested Loop (cost=0.54..33.58 rows=4 width=10)
-> Function Scan on f1 t (cost=0.25..0.29 rows=4 width=4)
-> Index Scan using links_pkey on links l (cost=0.29..8.31 rows=1 width=6)
Index Cond: (c1 = t.c1)
(4 rows)
The correct/perfect plan.
In real life I have bigger "links" table and wrong plan slows execution significantly.
I found several workarounds. And it is not a problem anymore for me.
I just want to report this "strange thing".
I tried to look into source code, found some interesting places there but I think it is useless: Pg developers know the situation much better than me.