Re: Optimization idea

Cédric Villemain <cedric.villemain.debian@xxxxxxxxx> · Mon, 26 Apr 2010 11:33:29 +0200

2010/4/26 Vlad Arkhipov <arhipov@xxxxxxxxxxxx>:
>
>> On Thu, Apr 22, 2010 at 10:37 PM, Vlad Arkhipov <arhipov@xxxxxxxxxxxx>
>> wrote:
>>
>>>
>>> I don't think this is just an issue with statistics, because the same
>>> problem arises when I try executing a query like this:
>>>
>>
>> I'm not sure how you think this proves that it isn't a problem with
>> statistics, but I think what you should be focusing on here, looking
>> back to your original email, is that the plans that are actually much
>> faster have almost as much estimated cost as the slower one.  Since
>> all your data is probably fully cached, at a first cut, I might try
>> setting random_page_cost and seq_page_cost to 0.005 or so, and
>> adjusting effective_cache_size to something appropriate.
>>
>> ...Robert
>>
>>
>
> Ok. I thougth it's quite obvious because of these two queries. I can't
> understand why the estimated rows count is 40040 in the first plan.

In the first query, the planner doesn't use the information of the 2,3,4.
It just does a : I'll bet I'll have 2 rows in t1 (I think it should
say 3, but it doesn't)
So it divide the estimated number of rows in the t2 table by 5
(different values) and multiply by 2 (rows) : 40040.

In the second query the planner use a different behavior : it did
expand the value of t1.t to t2.t for each join relation and find a
costless plan. (than the one using seqscan on t2)

We are here in corner case situation where n_distinc valuest <
statistics on the column and where we might be able to improve the
planner decision. I believe I have already read something on this
topic on -hackers...

>
> test=# explain analyze select * from t2 join t1 on t1.t = t2.t where
> t1.t in (2,3,4);
>                                                   QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------
> Hash Join  (cost=1.09..2319.87 rows=40040 width=32) (actual
> time=0.050..356.269 rows=400 loops=1)
>  Hash Cond: (t2.t = t1.t)
>  ->  Seq Scan on t2  (cost=0.00..1543.00 rows=100100 width=16) (actual
> time=0.013..176.087 rows=100100 loops=1)
>  ->  Hash  (cost=1.07..1.07 rows=2 width=16) (actual time=0.023..0.023
> rows=3 loops=1)
>        Buckets: 1024  Batches: 1  Memory Usage: 1kB
>        ->  Seq Scan on t1  (cost=0.00..1.07 rows=2 width=16) (actual
> time=0.006..0.014 rows=3 loops=1)
>              Filter: (t = ANY ('{2,3,4}'::bigint[]))
> Total runtime: 356.971 ms
> (8 rows)
>
> test=# explain analyze select * from t2 join t1 on t1.t = t2.t where
> t1.t = 2 union all select * from t2 join t1 on t1.t = t2.t where t1.t =
> 3 union all select * from t2 join t1 on t1.t = t2.t where t1.t = 4;
>                                                        QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------------
> Append  (cost=0.00..112.42 rows=407 width=32) (actual time=0.048..3.487
> rows=400 loops=1)
>  ->  Nested Loop  (cost=0.00..47.51 rows=197 width=32) (actual
> time=0.045..1.061 rows=200 loops=1)
>        ->  Seq Scan on t1  (cost=0.00..1.06 rows=1 width=16) (actual
> time=0.011..0.014 rows=1 loops=1)
>              Filter: (t = 2)
>        ->  Index Scan using t_idx on t2  (cost=0.00..44.48 rows=197
> width=16) (actual time=0.026..0.382 rows=200 loops=1)
>              Index Cond: (pg_temp_2.t2.t = 2)
>  ->  Nested Loop  (cost=0.00..32.67 rows=117 width=32) (actual
> time=0.019..0.599 rows=100 loops=1)
>        ->  Seq Scan on t1  (cost=0.00..1.06 rows=1 width=16) (actual
> time=0.003..0.006 rows=1 loops=1)
>              Filter: (t = 3)
>        ->  Index Scan using t_idx on t2  (cost=0.00..30.43 rows=117
> width=16) (actual time=0.010..0.211 rows=100 loops=1)
>              Index Cond: (pg_temp_2.t2.t = 3)
>  ->  Nested Loop  (cost=0.00..28.17 rows=93 width=32) (actual
> time=0.017..0.534 rows=100 loops=1)
>        ->  Seq Scan on t1  (cost=0.00..1.06 rows=1 width=16) (actual
> time=0.005..0.008 rows=1 loops=1)
>              Filter: (t = 4)
>        ->  Index Scan using t_idx on t2  (cost=0.00..26.18 rows=93
> width=16) (actual time=0.007..0.187 rows=100 loops=1)
>              Index Cond: (pg_temp_2.t2.t = 4)
> Total runtime: 4.190 ms
> (17 rows)
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

-- 
Cédric Villemain

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance