On Wed, Apr 27, 2011 at 5:19 PM, Sok Ann Yap <sokann@xxxxxxxxx> wrote:
On Thu, Apr 28, 2011 at 7:23 AM, Kevin Grittner
I understand the need to tune PostgreSQL properly for my use case.
What I am curious about is, for the data set I have, under what
circumstances (hardware/workload/cache status/etc) would a sequential
scan really be faster than an index scan for that particular query?
Possibly none on your hardware - if the index is likely to be in memory along with the actual table rows. In which case, the cost for index scan (random page cost) should be made much closer to the cost for sequential access. It looks like the planner must use the same strategy on each iteration of the loop - it can't do index scan for some values and sequential scan for others, so it must be computing the cost as sequential_cost * (number of entries(44)) versus random_cost * (number of entries). If random page cost is unreasonably high, it's not hard to see how it could wind up looking more expensive to the planner, causing it to choose the sequential scan for each loop iteration. If it were able to change strategy on each iteration, it would be able to accurately assess cost for each iteration and choose the correct strategy for that value. As soon as you set the costs closer to actual cost for your system, postgres does make the correct choice. If there weren't enough memory that postgres could be 'sure' that the index would remain in cache at least for the duration of all 44 iterations due to high workload, it is easy to see how the index scan might become significantly more expensive than the sequential scan, since the index scan must also load the referenced page from the table - postgres cannot get values directly from the index.
To simulate a scenario when nothing is cached, I stopped PostgreSQL,
dropped all system cache (sync; echo 3 > /proc/sys/vm/drop_caches),
restarted PostgreSQL, and ran the query. A sequential scan run took
13.70 seconds, while an index scan run took 0.34 seconds, which is
still 40 times faster.
Also, I tried increasing effective_cache_size from 512MB to 3GB (the
database size is 2+GB), and it still favor sequential scan. The
estimated costs did not change at all.
Greg Smith had this to say in a another thread on this same subject:
effective_cache_size probably doesn't do as much as you suspect. It is used for one of the computations for whether an index is small enough that it can likely be read into memory efficiently. It has no impact on caching decisions outside of that.
This is why the cost for random page access must be fairly accurate.Even if the index is in memory, it still needs to access the page of data in the table referenced by the index, which is why the cost of random access must be accurate. That cost is a factor of both the performance of your storage infrastructure and the cache hit rate and can't really be computed by the database on the fly. You seem to be looking at the data which exposes the fact that random page access is fast and wondering why postgres isn't doing the right thing when postgres isn't doing the right thing precisely because it doesn't know that random page access is fast. Since you don't have particularly fast storage infrastructure, this is likely a function of cache hit rate, so you must factor in eventual load on the db when setting this value. While it may be fast in a lightly loaded test environment, those random page accesses will get much more expensive when competing with other concurrent disk access.
There's another thread currently active on this list (it started on April 12) with subject "Performance" which contains this explanation of what is going on and why you need to tune these parameters independently of effective_cache_size:
When the planner decides what execution plan to use,it computes a 'virtual cost' for different plans and then chooses thecheapest one.
Decreasing 'random_page_cost' decreases the expected cost of plansinvolving index scans, so that at a certain point it seems cheaper thana plan using sequential scans etc.You can see this when using EXPLAIN - do it with the original costvalues, then change the values (for that session only) and do theEXPLAIN only. You'll see how the execution plan suddenly changes andstarts to use index scans.The problem with random I/O is that it's usually much more expensivethan sequential I/O as the drives need to seek etc. The only case whenrandom I/O is just as cheap as sequential I/O is when all the data iscached in memory, because within RAM there's no difference betweenrandom and sequential access (right, that's why it's called RandomAccess Memory).
So in the previous post setting both random_page_cost and seq_page_costto the same value makes sense, because when the whole database fits intothe memory, there's no difference and index scans are favorable.In this case (the database is much bigger than the available RAM) thisno longer holds - index scans hit the drives, resulting in a lot ofseeks etc. So it's a serious performance killer ...
Note: I'm not a postgres developer, so I don't often contribute to these threads for fear of communicating misinformation. I'm sure someone will speak up if I got it wrong, but I happened to read that other thread this afternoon and I wasn't sure anyone else would bring it up, so I chimed in. Take my input with a grain of salt until confirmed by someone with more knowledge of postgres internals than I possess.
--sam