"Tom Lane" <tgl@xxxxxxxxxxxxx> writes: > "Joshua D. Drake" <jd@xxxxxxxxxxxxxxxxx> writes: > >> Exactly. Which is the point I am making. Five minutes of transactions >> is nothing (speaking generally).. In short, if we are in recovery, and >> we are not saturated the I/O and at least a single CPU, there is a huge >> amount of optimization *somewhere* to be done. > > You sure about that? I tested CVS HEAD just now, by setting the > checkpoint_ parameters really high, running pgbench for awhile, and > then killing the bgwriter to force a recovery cycle over all the WAL > generated by the pgbench run. What I saw was that the machine was 100% > disk write bound. Increasing shared_buffers helped, not in that the > write rate got less according to vmstat, but the completion time did. There are at least three definitions of "saturating the I/O" and it sounds like you two are using two different ones. 1) The processor is waiting on I/O all the time 2) The hard drives are all always handling a request 3) The hard drives are all handling the full bandwidth they're capable You would expect (1) and (2) to be the same for a single drive -- though in practice there seems to be a gap even between them. But for a raid array there can be a large difference, and the wider the raid stripe the larger the difference. In Tom's results: > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ > r b swpd free buff cache si so bi bo in cs us sy id wa st > 0 9 70024 29232 19876 824368 0 0 0 3152 1447 233 0 1 0 99 0 > 0 9 70024 29232 19876 824368 0 0 0 3660 1474 252 0 1 0 99 0 > 0 8 70024 28960 19876 824404 0 0 0 3176 1448 265 0 2 1 97 0 > > I don't see the machine sitting around doing nothing ... Note that even though the processor is 99% in wait state the drive is only handling about 3 MB/s. That translates into a seek time of 2.2ms which is actually pretty fast. So if this is a single drive (1) and (2) seem to be pretty much the same here. But note that if this were a raid array Postgres's wouldn't be getting any better results. A Raid array wouldn't improve i/o latency at all and since it's already 99% waiting for i/o Postgres is not going to be able to issue any more. But only one drive in the raid array will be busy at a time which would be far less than the maximum random access i/o the raid array is capable of. Heikki proposed a while back to use posix_fadvise() when processing logs to read-ahead blocks which the recover will need before actually attempting to recover them. On a raid array that would bring the 3MB/s above up to the maximum number of random accesses the raid array can handle (ie, definition (2) above). That's still going to be a far cry from the maximum bandwidth the hard drive can handle. Even that single drive can probably handle 60MB/s sequential I/O. That's probably the source of the unrealistic expectations people have. It's easy to watch the bandwidth number as the headline number to measure i/o utilization. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support! ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq