Re: Limit of bgwriter_lru_maxpages of max. 1000?

Greg Smith <gsmith@xxxxxxxxxxxxx> · Fri, 2 Oct 2009 19:24:32 -0400 (EDT)

On Fri, 2 Oct 2009, Scott Marlowe wrote:

I found that lowering checkpoint completion target was what helped.
Does that seem counter-intuitive to you?

Generally, but there are plenty of ways you can get into a state where a 
short but not immediate checkpoint is better.  For example, consider a 
case where your buffer cache is filled with really random stuff.  There's 
a sorting horizon in effect, where your OS and/or controller makes 
decisions about what order to write things based on the data it already 
has around, not really knowing what's coming in the near future.

Let's say you've got 256MB of cache in the disk controller, you have 1GB 
of buffer cache to write out, and there's 8GB of RAM in the server so it 
can cache the whole write.  If you wrote it out in a big burst, the OS 
would elevator sort things and feed them to the controller in disk order. 
Very efficient, one pass over the disk to write everything out.

But if you broke that up into 256MB write pieces instead on the database 
side, pausing after each chunk was written, the OS would only be sorting 
across 256MB at a time, and would basically fill the controller cache up 
with that before it saw the larger picture.  The disk controller can end 
up making seek decisions with that small of a planning window now that are 
not really optimal, making more passes over the disk to write the same 
data out.  If the timing between the DB write cache and the OS is 
pathologically out of sync here, the result can end up being slower than 
had you just written out in bigger chunks instead.  This is one reason I'd 
like to see fsync calls happen earlier and more evenly than they do now, 
to reduce these edge cases.

The usual approach I take in this situation is to reduce the amount of 
write caching the OS does, so at least things get more predictable.  A 
giant write cache always gives the best average performance, but the 
worst-case behavior increases at the same time.

There was a patch floating around at one point that sorted all the 
checkpoint writes by block order, which would reduce how likely it is 
you'll end up in one of these odd cases.  That turned out to be hard to 
nail down the benefit of though, because in a typical case the OS caching 
here trumps any I/O scheduling you try to do in user land, and it's hard 
to repeatibly generate scattered data in a benchmark situation.

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general