Current Folder: Sent Sign Out Compose Addresses Folders Options Autoreply Search Help Calendar G-Hosting.cz Message List | Delete | Edit Message as New Previous | Next Forward | Forward as Attachment | Reply | Reply All Subject: Re: Random Page Cost and Planner From: tv@xxxxxxxx Date: Wed, May 26, 2010 12:01 pm To: "David Jarvis" <thangalin@xxxxxxxxx> Priority: Normal Options: View Full Header | View Printable Version | Download this as a file | View Message details > Hi, Tom. > > Yes, that is what happened, making the tests rather meaningless, and > giving > me the false impression that the indexes were being used. They were but > only > because of cached results. When multiple users making different queries, > the > performance will return to ~80s per query. > > I also tried Kevin's suggestion, which had no noticeable effect: > effective_cache_size = 512MB > > That said, when using the following condition, the query is fast (1 > second): > > extract(YEAR FROM sc.taken_start) >= 1963 AND > extract(YEAR FROM sc.taken_end) <= 2009 AND > > " -> Index Scan using measurement_013_stc_idx on > measurement_013 m (cost=0.00..511.00 rows=511 width=15) (actual > time=0.018..3.601 rows=3356 loops=104)" > " Index Cond: ((m.station_id = sc.station_id) AND > (m.taken >= sc.taken_start) AND (m.taken <= sc.taken_end) AND > (m.category_id > = 7))" > > This condition makes it slow (13 seconds on first run, 8 seconds > thereafter): > > * extract(YEAR FROM sc.taken_start) >= 1900 AND > * extract(YEAR FROM sc.taken_end) <= 2009 AND > > " Filter: (category_id = 7)" > " -> Seq Scan on measurement_013 m > (cost=0.00..359704.80 rows=18118464 width=15) (actual time=0.008..4025.692 > rows=18118395 loops=1)" > > At this point, I'm tempted to write a stored procedure that iterates over > each station category for all the years of each station. My guess is that > the planner's estimate for the number of rows that will be returned by > *extract(YEAR > FROM sc.taken_start) >= 1900* is incorrect and so it chooses a full table > scan for all rows. Even though the lower bound appears to be a constant > value of the 1900, the average year a station started collecting data was > 44 > years ago (1965), and did so for an average of 21.4 years. > > The part I am having trouble with is convincing PG to use the index for > the > station ID and the date range for when the station was active. Each > station > has a unique ID; the data in the measurement table is ordered by > measurement > date then by station. Well, don't forget indexes may not be the best way to evaluate the query - if the selectivity is low (the query returns large portion of the table) the sequetial scan is actually faster. The problem is using index means you have to read the index blocks too, and then the table blocks, and this is actually random access. So your belief that thanks to using indexes the query will run faster could be false. And this is what happens in the queries above - the first query covers years 1963-2009, while the second one covers 1900-2009. Given the fact this table contains ~40m rows, the first query returns about 0.01% (3k rows) while the second one returns almost 50% of the data (18m rows). So I doubt this might be improved using an index ... But you can try that by setting enable_seqscan=off or proper setting of the random_page_cost / seq_page_cost variables (so that the plan with indexes is cheaper than the sequential scan). You can do that in the session (e.g. use SET enable_seqscan=off) so that you won't harm other sessions. > Should I add a clustered index by station then by date? > > Any other suggestions are very much appreciated. Well, the only thing that crossed my mind is partitioning with properly defined constraints and constrain_exclusion=on. I'd recommend partitioning by time (each year a separate partition) but you'll have to investigate that on your own (depends on your use-cases). BTW the cache_effective_size mentioned in the previous posts is just an 'information parameter' - it does not increase the amount of memory allocated by PostgreSQL. It merely informs PostgreSQL of expected disk cache size maintained by the OS (Linux), so that PostgreSQL may estimate the change that the requested data are actually cached (and won't be read from the disk). regards Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance