On Sat, 26 Sep 2009, Jeff Janes wrote:
On Sat, Sep 26, 2009 at 8:19 AM, Greg Smith <gsmith@xxxxxxxxxxxxx> wrote:
Another problem spot are checkpoints. If you dirty a very large buffer
cache, that whole thing will have to get dumped to disk eventually, and on
some workloads people have found they have to reduce shared_buffers
specifically to keep this from being too painful.
Is this the case even if checkpoint_completion_target is set close to 1.0?
Sure. checkpoint_completion_target aims to utilize more of the space
between each checkpoint by spreading them out over more of that space, but
it alone doesn't change the fact that checkpoints are only so long. By
default, you're going to get one every five minutes, and on active systems
they can come every few seconds if you're not aggressive with increasing
checkpoint_segments.
Some quick math gives an idea of the scale of the problem. A single cheap
disk can write random I/O (which checkpoints writes often are) at 1-2MB/s;
let's call it 100MB/minute. That means that in 5 minutes, a single disk
system might be hard pressed to write even 500MB of data out. But you can
easily dirty 500MB in seconds nowadays. Now imagine shared_buffers is
40GB and you've dirtied a bunch of it; how long will that take to clear
even on a fast RAID system? It won't be quick, and the whole system will
grind to a halt at the end of the checkpoint as all the buffered writes
queued up are forced out.
If you dirty buffers fast enough to dirty most of a huge shared_buffers
area between checkpoints, then it seems like lowering the shared_buffers
wouldn't reduce the amount of I/O needed, it would just shift the I/O
from checkpoints to the backends themselves.
What's even worse is that backends can be writing data and filling the OS
buffer cache in between checkpoints too, but all of that is forced to
complete before the checkpoint can finish too. You can easily start the
checkpoint process with the whole OS cache filled with backend writes that
will slow checkpoint ones if you're not careful.
Because disks are slow, you need to get things that are written to disk as
soon as feasible, so the OS has more time to work on them, reorder for
efficient writing, etc.
Ultimately, the sooner you get I/O to the OS cache to write, the better,
*unless* you're going to write that same block over again before it must
go to disk. Normally you want buffers that aren't accessed often to get
written out to disk early rather than linger until checkpoint time,
there's nothing wrong with a backend doing a write if that block wasn't
going to be used again soon. The ideal setup from a latency perspective is
that you size shared_buffers just large enough to hold the things you
write to regularly, but not so big that it caches every write.
It looks like checkpoint_completion_target was introduced in 8.3.0
Correct. Before then, you had no hope for reducing checkpoint overhead
but to use very small settings for shared_buffers, particularly if you
cranked the old background writer up so that it wrote lots of redundant
information too (that's was the main result of "tuning" it on versions
before 8.3 as well).
--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance