常超 wrote > Hi,all > I have a table to save received measure data. > > > CREATE TABLE measure_data > ( > id serial NOT NULL, > telegram_id integer NOT NULL, > measure_time timestamp without time zone NOT NULL, > item_id integer NOT NULL, > val double precision, > CONSTRAINT measure_data_pkey PRIMARY KEY (id) > ); > > CREATE INDEX index_measure_data_telegram_id ON measure_data USING btree > (telegram_id); > > > in my scenario,a telegram contains measure data for multiple data items > and timestamps, > BTW,another table is for telegram. > > The SQL I used in my application is > select * from measure_data where telegram_id in(1,2,...,n) > and this query used the index_measure_data_telegram_id index,as expected. > > In order to see the performance of my query , > I used the following query to search the measure data for randomly 30 > telegrams. > > > explain analyze > SELECT md.* > FROM measure_data md > where telegram_id in > ( > SELECT distinct > trunc((132363-66484) * random() + 66484) > FROM generate_series(1,30) as s(telegram_id) > ) > ; > > the 132363 and 66484 are the max and min of the telegram id,separately. > > What surprised me is that index is not used,instead,a seq scan is > performed on measure_data. > Although,intuitively,in this case,it is much wiser to use the index. > Would you please give some clue to why this happened? > > "Hash Semi Join (cost=65.00..539169.32 rows=10277280 width=28) (actual > time=76.454..17177.054 rows=9360 loops=1)" > " Hash Cond: ((md.telegram_id)::double precision = (trunc(((65879::double > precision * random()) + 66484::double precision))))" > " -> Seq Scan on measure_data md (cost=0.00..356682.60 rows=20554560 > width=28) (actual time=0.012..13874.809 rows=20554560 loops=1)" > " -> Hash (cost=52.50..52.50 rows=1000 width=8) (actual > time=0.062..0.062 rows=30 loops=1)" > " Buckets: 1024 Batches: 1 Memory Usage: 2kB" > " -> HashAggregate (cost=22.50..42.50 rows=1000 width=0) (actual > time=0.048..0.053 rows=30 loops=1)" > " -> Function Scan on generate_series s (cost=0.00..20.00 > rows=1000 width=0) (actual time=0.020..0.034 rows=30 loops=1)" > "Total runtime: 17177.527 ms" The planner expects to need to return half the table when you provide 1,000 distinct telegram_ids, which is best handled by scanning the whole table sequentially and tossing out invalid data. I am curious if the plan will be different if you added a LIMIT 30 to the sub-query. The root of the problem is the planner has no way of knowing whether generate_series is going to return 1 or 1,000,000 rows so by default it (and all functions) are assumed (by the planner) to return 1,000 rows. By adding an explicit limit you can better inform the planner as to how many rows you are going to be passing up to the parent query and it will hopefully, with knowledge of only 30 distinct values, use the index. -- View this message in context: http://postgresql.1045698.n5.nabble.com/how-do-functions-affect-query-plan-tp5803993p5803996.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.