Re: PG 8.3 and large shared buffer settings

Greg Smith <gsmith@xxxxxxxxxxxxx> · Sat, 26 Sep 2009 17:19:11 -0400 (EDT)

On Sat, 26 Sep 2009, Jeff Janes wrote:

On Sat, Sep 26, 2009 at 8:19 AM, Greg Smith <gsmith@xxxxxxxxxxxxx> wrote:

Another problem spot are checkpoints.  If you dirty a very large buffer
cache, that whole thing will have to get dumped to disk eventually, and on
some workloads people have found they have to reduce shared_buffers
specifically to keep this from being too painful.

Is this the case even if checkpoint_completion_target is set close to 1.0?

Sure.  checkpoint_completion_target aims to utilize more of the space 
between each checkpoint by spreading them out over more of that space, but 
it alone doesn't change the fact that checkpoints are only so long.  By 
default, you're going to get one every five minutes, and on active systems 
they can come every few seconds if you're not aggressive with increasing 
checkpoint_segments.

Some quick math gives an idea of the scale of the problem.  A single cheap 
disk can write random I/O (which checkpoints writes often are) at 1-2MB/s; 
let's call it 100MB/minute.  That means that in 5 minutes, a single disk 
system might be hard pressed to write even 500MB of data out.  But you can 
easily dirty 500MB in seconds nowadays.  Now imagine shared_buffers is 
40GB and you've dirtied a bunch of it; how long will that take to clear 
even on a fast RAID system?  It won't be quick, and the whole system will 
grind to a halt at the end of the checkpoint as all the buffered writes 
queued up are forced out.

If you dirty buffers fast enough to dirty most of a huge shared_buffers 
area between checkpoints, then it seems like lowering the shared_buffers 
wouldn't reduce the amount of I/O needed, it would just shift the I/O 
from checkpoints to the backends themselves.

What's even worse is that backends can be writing data and filling the OS 
buffer cache in between checkpoints too, but all of that is forced to 
complete before the checkpoint can finish too.  You can easily start the 
checkpoint process with the whole OS cache filled with backend writes that 
will slow checkpoint ones if you're not careful.

Because disks are slow, you need to get things that are written to disk as 
soon as feasible, so the OS has more time to work on them, reorder for 
efficient writing, etc.

Ultimately, the sooner you get I/O to the OS cache to write, the better, 
*unless* you're going to write that same block over again before it must 
go to disk.  Normally you want buffers that aren't accessed often to get 
written out to disk early rather than linger until checkpoint time, 
there's nothing wrong with a backend doing a write if that block wasn't 
going to be used again soon. The ideal setup from a latency perspective is 
that you size shared_buffers just large enough to hold the things you 
write to regularly, but not so big that it caches every write.

It looks like checkpoint_completion_target was introduced in 8.3.0

Correct.  Before then, you had no hope for reducing checkpoint overhead 
but to use very small settings for shared_buffers, particularly if you 
cranked the old background writer up so that it wrote lots of redundant 
information too (that's was the main result of "tuning" it on versions 
before 8.3 as well).

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance