Optimization idea

Vlad Arkhipov <arhipov@xxxxxxxxxxxx> · Thu, 22 Apr 2010 18:25:46 +0900

Please do this small optimization if it is possible. It seem that the 
optimizer have the all information to create a fast plan but it does not 
do that.

create temp table t1 (id bigint, t bigint);

insert into t1 values (1, 1);
insert into t1 values (2, 2);
insert into t1 values (2, 3);
insert into t1 values (2, 4);
insert into t1 values (3, 5);

create temp table t2 (id bigint, t bigint);

insert into t2 (id, t)
select g, 2
from generate_series(1, 200) g;

insert into t2 (id, t)
select g, 3
from generate_series(201, 300) g;

insert into t2 (id, t)
select g, 4
from generate_series(301, 400) g;

insert into t2 (id, t)
select g, 1
from generate_series(401, 100000) g;

insert into t2 (id, t)
select g, 5
from generate_series(100001, 100100) g;

create index t_idx on t2(t);

analyze t1;
analyze t2;

explain analyze
select *
from t2
 join t1 on t1.t = t2.t
where t1.t = 2

explain analyze
select *
from t2
 join t1 on t1.t = t2.t
where t1.id = 3

explain analyze
select *
from t2
where t2.t in (2, 3, 4)

These two queries are completely equal and optimizator should know it as 
I see from the plans:

"Hash Join  (cost=1.09..2667.09 rows=75000 width=32) (actual 
time=0.026..100.207 rows=400 loops=1)"
"  Hash Cond: (t2.t = t1.t)"
"  ->  Seq Scan on t2  (cost=0.00..1541.00 rows=100000 width=16) (actual 
time=0.007..47.083 rows=100000 loops=1)"
"  ->  Hash  (cost=1.05..1.05 rows=3 width=16) (actual time=0.011..0.011 
rows=3 loops=1)"
"        ->  Seq Scan on t1  (cost=0.00..1.05 rows=3 width=16) (actual 
time=0.005..0.008 rows=3 loops=1)" <-- HERE IS THE PROBLEM. IF THE 
ESTIMATED COUNT = 1 OPTIMIZER BUILDS THE CORRECT FAST PLAN, BUT IF THE 
ESTIMATION IS GREATER THAN 1 WE HAVE A PROBLEM
"              Filter: (id = 2)"
"Total runtime: 100.417 ms"

"Nested Loop  (cost=0.00..1024.46 rows=20020 width=32) (actual 
time=0.030..0.222 rows=100 loops=1)"
"  ->  Seq Scan on t1  (cost=0.00..1.05 rows=1 width=16) (actual 
time=0.008..0.009 rows=1 loops=1)"
"        Filter: (id = 3)"
"  ->  Index Scan using t_idx on t2  (cost=0.00..773.16 rows=20020 
width=16) (actual time=0.016..0.078 rows=100 loops=1)"
"        Index Cond: (t2.t = t1.t)"
"Total runtime: 0.296 ms"

"Bitmap Heap Scan on t2  (cost=16.09..556.80 rows=429 width=16) (actual 
time=0.067..0.256 rows=400 loops=1)"
"  Recheck Cond: (t = ANY ('{2,3,4}'::bigint[]))"
"  ->  Bitmap Index Scan on t_idx  (cost=0.00..15.98 rows=429 width=0) 
(actual time=0.056..0.056 rows=400 loops=1)"
"        Index Cond: (t = ANY ('{2,3,4}'::bigint[]))"
"Total runtime: 0.458 ms"

An ugly workaround is to add the column t1(t) in the table t2.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance