On Fri, 2 Oct 2009, Greg Smith wrote:
On Fri, 2 Oct 2009, Scott Marlowe wrote:
I found that lowering checkpoint completion target was what helped.
Does that seem counter-intuitive to you?
I set it to 0.0 now.
Generally, but there are plenty of ways you can get into a state where a
short but not immediate checkpoint is better. For example, consider a case
where your buffer cache is filled with really random stuff. There's a
sorting horizon in effect, where your OS and/or controller makes decisions
about what order to write things based on the data it already has around, not
really knowing what's coming in the near future.
Ok, if checkpoint doesn't block anything on normal operation time doesn't
really matter.
Let's say you've got 256MB of cache in the disk controller, you have 1GB of
buffer cache to write out, and there's 8GB of RAM in the server so it can
cache the whole write. If you wrote it out in a big burst, the OS would
elevator sort things and feed them to the controller in disk order. Very
efficient, one pass over the disk to write everything out.
But if you broke that up into 256MB write pieces instead on the database
side, pausing after each chunk was written, the OS would only be sorting
across 256MB at a time, and would basically fill the controller cache up with
that before it saw the larger picture. The disk controller can end up making
seek decisions with that small of a planning window now that are not really
optimal, making more passes over the disk to write the same data out. If the
timing between the DB write cache and the OS is pathologically out of sync
here, the result can end up being slower than had you just written out in
bigger chunks instead. This is one reason I'd like to see fsync calls happen
earlier and more evenly than they do now, to reduce these edge cases.
The usual approach I take in this situation is to reduce the amount of write
caching the OS does, so at least things get more predictable. A giant write
cache always gives the best average performance, but the worst-case behavior
increases at the same time.
There was a patch floating around at one point that sorted all the checkpoint
writes by block order, which would reduce how likely it is you'll end up in
one of these odd cases. That turned out to be hard to nail down the benefit
of though, because in a typical case the OS caching here trumps any I/O
scheduling you try to do in user land, and it's hard to repeatibly generate
scattered data in a benchmark situation.
Ok, on a basic insert test and a systemtap script
(http://www.wiesinger.com/opensource/systemtap/postgresql-checkpoint.stp)
checkpoint is still a major I/O spike.
################################################################################
Buffers between : Sun Oct 4 18:29:50 2009, synced 55855 buffer(s), flushed 744 buffer(s) between checkpoint
Checkpoint start: Sun Oct 4 18:29:50 2009
Checkpoint end : Sun Oct 4 18:29:56 2009, synced 12031 buffer(s), flushed 12031 buffer(s)
################################################################################
Buffers between : Sun Oct 4 18:30:20 2009, synced 79000 buffer(s), flushed 0 buffer(s) between checkpoint
Checkpoint start: Sun Oct 4 18:30:20 2009
Checkpoint end : Sun Oct 4 18:30:26 2009, synced 10753 buffer(s), flushed 10753 buffer(s)
################################################################################
Buffers between : Sun Oct 4 18:30:50 2009, synced 51120 buffer(s), flushed 1007 buffer(s) between checkpoint
Checkpoint start: Sun Oct 4 18:30:50 2009
Checkpoint end : Sun Oct 4 18:30:56 2009, synced 11899 buffer(s), flushed 11912 buffer(s)
################################################################################
Ok, I further had a look at the code to understand the behavior of the
buffercache and the background writer since that wasn't logically.
So as far as I saw the basic algorithm is:
1.) Normally (non checkpoints) only dirty and non recently used pages
(usage_count == 0) are flushed to disk. I think that's basically fine as a
strategy as indexes might update blocks more than once. It's also ok that
blocks are written and not flushed (well be done on checkpoint time).
2.) At checkpoints write out all dirty buffers and flush all previously
written and newly written. Also spreading I/O seems also ok to me now.
BUT: I think I've found 2 major bugs in the implementation (or I didn't
understand something correctly). Codebase analyzed is 8.3.8 since I
currently use it.
##############################################
Bug1: usage_count is IHMO not consistent
##############################################
I think this has been introduced with:
http://git.postgresql.org/gitweb?p=postgresql.git;a=blobdiff;f=src/backend/storage/buffer/bufmgr.c;h=6e6b862273afea40241e410e18fd5d740c2b1643;hp=97f7822077de683989a064cdc624a025f85e54ab;hb=ebf3d5b66360823edbdf5ac4f9a119506fccd4c0;hpb=98ffa4e9bd75c8124378c712933bb13d2697b694
So either usage_count = 1 init in BufferAlloc is not correct or
SyncOneBuffer() with skip_recently_used and usage_count=1 is not correct:
if (bufHdr->refcount == 0 && bufHdr->usage_count == 0)
result |= BUF_REUSABLE;
else if (skip_recently_used)
{
/* Caller told us not to write recently-used buffers */
UnlockBufHdr(bufHdr);
return result;
}
##############################################
Bug2: Double iteration of buffers
##############################################
As you can seen in the calling tree below there is double iteration
with buffers involved. This might be a major performance bottleneck.
// Checkpoint buffer sync
BufferSync()
loop buffers:
SyncOneBuffer() // skip_recently_used=false
CheckpointWriteDelay() // Bug here?: Since BgBufferSync() is called were again is iterated!!
CheckpointWriteDelay()
if (IsCheckpointOnSchedule())
{ BgBufferSync()
CheckArchiveTimeout()
BgWriterNap()
}
BgBufferSync()
loop buffers:
SyncOneBuffer() // skip_recently_used=true, ok here since we don't want to flush recently used block (e.g. indices). But improvement (e.g. aging) is IHMO necessary
##############################################
BTW: Are there some tests available how fast a buffer cache hit is and a
disk cache hit is (not in the buffer cache but in the disk cache)? I'll
asked, because a lot of locking is involved in the code.
BTW2: Oracle buffercache and background writer strategy is also
interesting.
http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/process.htm#i7259
http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/memory.htm#i10221
Thnx for feedback.
Ciao,
Gerhard
--
http://www.wiesinger.com/
-----------------------------------
src/backend/postmaster/bgwriter.c
-----------------------------------
BackgroundWriterMain()
loop forever:
timeout:
CreateCheckPoint() // NON_IMMEDIATE
smgrcloseall()
nontimeout:
BgBufferSync()
sleep
// Rest is done in XLogWrite()
RequestCheckpoint()
CreateCheckPoint() or signal through shared memory segment
smgrcloseall()
CheckpointWriteDelay()
if (IsCheckpointOnSchedule())
{ BgBufferSync()
CheckArchiveTimeout()
BgWriterNap()
}
-----------------------------------
src/backend/commands/dbcommands.c
-----------------------------------
createdb()
RequestCheckpoint() // CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
dropdb()
RequestCheckpoint() // CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-----------------------------------
src/backend/commands/tablespace.c
-----------------------------------
DropTableSpace()
RequestCheckpoint()
-----------------------------------
src/backend/tcop/utility.c
-----------------------------------
ProcessUtility()
// Command CHECKPOINT;
RequestCheckpoint() // CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-----------------------------------
src/backend/access/transam/xlog.c
-----------------------------------
CreateCheckPoint()
CheckPointGuts()
CheckPointCLOG()
CheckPointSUBTRANS()
CheckPointMultiXact()
CheckPointBuffers()); /* performs all required fsyncs */
CheckPointTwoPhase()
XLogWrite()
too_much_transaction_log_consumed:
RequestCheckpoint() // NON_IMMEDIATE
pg_start_backup()
RequestCheckpoint() // CHECKPOINT_FORCE | CHECKPOINT_WAIT
XLogFlush()
// Flush transaction log
-----------------------------------
src/backend/storage/buffer/bufmgr.c
-----------------------------------
CheckPointBuffers()
BufferSync()
smgrsync()
// Checkpoint buffer sync
BufferSync()
loop buffers:
SyncOneBuffer() // skip_recently_used=false
CheckpointWriteDelay() // Bug here?: Since BgBufferSync() is called were again is iterated!!
// Backgroundwriter buffer sync
BgBufferSync()
loop buffers:
SyncOneBuffer() // skip_recently_used=true, ok here since we don't want to flush recently used block (e.g. indices). But improvement (e.g. aging) is IHMO necessary
SyncOneBuffer() // Problem with skip_recently_used and usage_count=1 (not flushed!)
FlushBuffer()
FlushBuffer()
XLogFlush()
smgrwrite()
CheckPointBuffers()
BufferSync()
smgrsync()
BufferAlloc() // Init with usage_count=1 is not logically => Will never be flushed in bg_writer!
PinBuffer()
PinBuffer()
usage_count++;
-----------------------------------
src/backend/storage/buffer/localbuf.c
-----------------------------------
LocalBufferAlloc()
usage_count++;
-----------------------------------
src/backend/storage/smgr/md.c
-----------------------------------
smgrwrite() = mdwrite()
=> write file (not flushed immediatly) but registers for later flushling with register_dirty_segment at checkpoint time
smgrsync() = mdsync()
=> Syncs registered non flushed files.
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general