Re: how do functions affect query plan?

changchao <chang-chao@xxxxxxxxxxx> · Thu, 15 May 2014 15:19:13 +0900

Hi,David

Seems that the root of evil is in the function(random,trunc),
although I don't know why.

Here is the comparison.

1.w/o function : index is wisely used.(Even without the limit 30 clause)

explain analyze
SELECT md.*
  FROM measure_data md
  where telegram_id in 
    (
         SELECT  66484 + (132363-66484)/30 * i 
         FROM generate_series(1,30) as s(i)
         limit 30
    )
  ;

"Nested Loop  (cost=10.01..39290.79 rows=10392 width=28) (actual time=0.079..3.490 rows=9360 loops=1)"
"  ->  HashAggregate  (cost=0.83..1.13 rows=30 width=4) (actual time=0.027..0.032 rows=30 loops=1)"
"        ->  Limit  (cost=0.00..0.45 rows=30 width=4) (actual time=0.013..0.020 rows=30 loops=1)"
"              ->  Function Scan on generate_series s  (cost=0.00..15.00 rows=1000 width=4) (actual time=0.011..0.016 rows=30 loops=1)"
"  ->  Bitmap Heap Scan on measure_data md  (cost=9.19..1306.20 rows=346 width=28) (actual time=0.030..0.075 rows=312 loops=30)"
"        Recheck Cond: (telegram_id = ((66484 + (2195 * s.i))))"
"        ->  Bitmap Index Scan on index_measure_data_telegram_id  (cost=0.00..9.10 rows=346 width=0) (actual time=0.025..0.025 rows=312 loops=30)"
"              Index Cond: (telegram_id = ((66484 + (2195 * s.i))))"
"Total runtime: 3.714 ms"

2.when function is there: seq scan

explain analyze
SELECT md.*
  FROM measure_data md
  where telegram_id in 
    (
         SELECT  trunc((132363-66484) * random()) +66484
         FROM generate_series(1,30) as s(i)
         limit 30
    )
 ;

"Hash Join  (cost=1.65..490288.89 rows=10277280 width=28) (actual time=0.169..4894.847 rows=9360 loops=1)"
"  Hash Cond: ((md.telegram_id)::double precision = ((trunc((65879::double precision * random())) + 66484::double precision)))"
"  ->  Seq Scan on measure_data md  (cost=0.00..356682.60 rows=20554560 width=28) (actual time=0.010..2076.932 rows=20554560 loops=1)"
"  ->  Hash  (cost=1.28..1.28 rows=30 width=8) (actual time=0.041..0.041 rows=30 loops=1)"
"        Buckets: 1024  Batches: 1  Memory Usage: 2kB"
"        ->  HashAggregate  (cost=0.98..1.28 rows=30 width=8) (actual time=0.034..0.036 rows=30 loops=1)"
"              ->  Limit  (cost=0.00..0.60 rows=30 width=0) (actual time=0.016..0.026 rows=30 loops=1)"
"                    ->  Function Scan on generate_series s  (cost=0.00..20.00 rows=1000 width=0) (actual time=0.015..0.023 rows=30 loops=1)"
"Total runtime: 4895.239 ms"

----------------------------------------
> Date: Wed, 14 May 2014 22:43:24 -0700
> From: david.g.johnston@xxxxxxxxx
> To: pgsql-performance@xxxxxxxxxxxxxx
> Subject: Re: [PERFORM] how do functions affect query plan?
>
> 常超 wrote
>> Hi,all
>> I have a table to save received measure data.
>>
>>
>> CREATE TABLE measure_data
>> (
>>   id serial NOT NULL,
>>   telegram_id integer NOT NULL,
>>   measure_time timestamp without time zone NOT NULL,
>>   item_id integer NOT NULL,
>>   val double precision,
>>   CONSTRAINT measure_data_pkey PRIMARY KEY (id)
>> );
>>
>> CREATE INDEX index_measure_data_telegram_id ON measure_data USING btree
>> (telegram_id);
>>
>>
>> in my scenario,a telegram contains measure data for multiple data items
>> and timestamps,
>> BTW,another table is for telegram.
>>
>> The SQL I used in my application is
>>   select * from measure_data where telegram_id in(1,2,...,n)
>> and this query used the index_measure_data_telegram_id index,as expected.
>>
>> In order to see the performance of my query ,
>> I used the following query to search the measure data for randomly 30
>> telegrams.
>>
>>
>> explain analyze
>> SELECT md.*
>>   FROM measure_data md
>>   where telegram_id in
>>     (
>>          SELECT distinct
>>          trunc((132363-66484) * random() + 66484)
>>          FROM generate_series(1,30) as s(telegram_id)
>>     )
>>   ;
>>
>> the 132363 and 66484 are the max and min of the telegram id,separately.
>>
>> What surprised me is that index is not used,instead,a seq scan is
>> performed on measure_data.
>> Although,intuitively,in this case,it is much wiser to use the index.
>> Would you please give some clue to why this happened?
>>
>> "Hash Semi Join  (cost=65.00..539169.32 rows=10277280 width=28) (actual
>> time=76.454..17177.054 rows=9360 loops=1)"
>> "  Hash Cond: ((md.telegram_id)::double precision = (trunc(((65879::double
>> precision * random()) + 66484::double precision))))"
>> "  ->  Seq Scan on measure_data md  (cost=0.00..356682.60 rows=20554560
>> width=28) (actual time=0.012..13874.809 rows=20554560 loops=1)"
>> "  ->  Hash  (cost=52.50..52.50 rows=1000 width=8) (actual
>> time=0.062..0.062 rows=30 loops=1)"
>> "        Buckets: 1024  Batches: 1  Memory Usage: 2kB"
>> "        ->  HashAggregate  (cost=22.50..42.50 rows=1000 width=0) (actual
>> time=0.048..0.053 rows=30 loops=1)"
>> "              ->  Function Scan on generate_series s  (cost=0.00..20.00
>> rows=1000 width=0) (actual time=0.020..0.034 rows=30 loops=1)"
>> "Total runtime: 17177.527 ms"
>
> The planner expects to need to return half the table when you provide 1,000
> distinct telegram_ids, which is best handled by scanning the whole table
> sequentially and tossing out invalid data.
>
> I am curious if the plan will be different if you added a LIMIT 30 to the
> sub-query.
>
> The root of the problem is the planner has no way of knowing whether
> generate_series is going to return 1 or 1,000,000 rows so by default it (and
> all functions) are assumed (by the planner) to return 1,000 rows. By adding
> an explicit limit you can better inform the planner as to how many rows you
> are going to be passing up to the parent query and it will hopefully, with
> knowledge of only 30 distinct values, use the index.
>
>
>
>
> --
> View this message in context: http://postgresql.1045698.n5.nabble.com/how-do-functions-affect-query-plan-tp5803993p5803996.html
> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance