Re: Slow PITR restore

Gregory Stark <stark@xxxxxxxxxxxxxxxx> · Thu, 13 Dec 2007 06:27:44 +0000

"Tom Lane" <tgl@xxxxxxxxxxxxx> writes:

> "Joshua D. Drake" <jd@xxxxxxxxxxxxxxxxx> writes:
>
>> Exactly. Which is the point I am making. Five minutes of transactions
>> is nothing (speaking generally).. In short, if we are in recovery, and
>> we are not saturated the I/O and at least a single CPU, there is a huge
>> amount of optimization *somewhere* to be done.
>
> You sure about that?  I tested CVS HEAD just now, by setting the
> checkpoint_ parameters really high, running pgbench for awhile, and
> then killing the bgwriter to force a recovery cycle over all the WAL
> generated by the pgbench run.  What I saw was that the machine was 100%
> disk write bound.  Increasing shared_buffers helped, not in that the
> write rate got less according to vmstat, but the completion time did.

There are at least three definitions of "saturating the I/O" and it sounds
like you two are using two different ones.

1) The processor is waiting on I/O all the time
2) The hard drives are all always handling a request
3) The hard drives are all handling the full bandwidth they're capable

You would expect (1) and (2) to be the same for a single drive -- though in
practice there seems to be a gap even between them. But for a raid array there
can be a large difference, and the wider the raid stripe the larger the
difference.

In Tom's results:

> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>  0  9  70024  29232  19876 824368    0    0     0  3152 1447  233  0  1  0 99  0
>  0  9  70024  29232  19876 824368    0    0     0  3660 1474  252  0  1  0 99  0
>  0  8  70024  28960  19876 824404    0    0     0  3176 1448  265  0  2  1 97  0
>
> I don't see the machine sitting around doing nothing ...

Note that even though the processor is 99% in wait state the drive is only
handling about 3 MB/s. That translates into a seek time of 2.2ms which is
actually pretty fast. So if this is a single drive (1) and (2) seem to be
pretty much the same here.

But note that if this were a raid array Postgres's wouldn't be getting any
better results. A Raid array wouldn't improve i/o latency at all and since
it's already 99% waiting for i/o Postgres is not going to be able to issue any
more. But only one drive in the raid array will be busy at a time which would
be far less than the maximum random access i/o the raid array is capable of.

Heikki proposed a while back to use posix_fadvise() when processing logs to
read-ahead blocks which the recover will need before actually attempting to
recover them. On a raid array that would bring the 3MB/s above up to the
maximum number of random accesses the raid array can handle (ie, definition
(2) above).

That's still going to be a far cry from the maximum bandwidth the hard drive
can handle. Even that single drive can probably handle 60MB/s sequential I/O.
That's probably the source of the unrealistic expectations people have. It's
easy to watch the bandwidth number as the headline number to measure i/o
utilization.

-- 
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq