Re: Limit of bgwriter_lru_maxpages of max. 1000?

Gerhard Wiesinger <lists@xxxxxxxxxxxxx> · Sun, 4 Oct 2009 22:19:21 +0200 (CEST)

On Fri, 2 Oct 2009, Greg Smith wrote:

On Fri, 2 Oct 2009, Scott Marlowe wrote:

I found that lowering checkpoint completion target was what helped.
Does that seem counter-intuitive to you?

I set it to 0.0 now.

Generally, but there are plenty of ways you can get into a state where a 
short but not immediate checkpoint is better.  For example, consider a case 
where your buffer cache is filled with really random stuff.  There's a 
sorting horizon in effect, where your OS and/or controller makes decisions 
about what order to write things based on the data it already has around, not 
really knowing what's coming in the near future.

Ok, if checkpoint doesn't block anything on normal operation time doesn't 
really matter.

Let's say you've got 256MB of cache in the disk controller, you have 1GB of 
buffer cache to write out, and there's 8GB of RAM in the server so it can 
cache the whole write.  If you wrote it out in a big burst, the OS would 
elevator sort things and feed them to the controller in disk order. Very 
efficient, one pass over the disk to write everything out.

But if you broke that up into 256MB write pieces instead on the database 
side, pausing after each chunk was written, the OS would only be sorting 
across 256MB at a time, and would basically fill the controller cache up with 
that before it saw the larger picture.  The disk controller can end up making 
seek decisions with that small of a planning window now that are not really 
optimal, making more passes over the disk to write the same data out.  If the 
timing between the DB write cache and the OS is pathologically out of sync 
here, the result can end up being slower than had you just written out in 
bigger chunks instead.  This is one reason I'd like to see fsync calls happen 
earlier and more evenly than they do now, to reduce these edge cases.

The usual approach I take in this situation is to reduce the amount of write 
caching the OS does, so at least things get more predictable.  A giant write 
cache always gives the best average performance, but the worst-case behavior 
increases at the same time.

There was a patch floating around at one point that sorted all the checkpoint 
writes by block order, which would reduce how likely it is you'll end up in 
one of these odd cases.  That turned out to be hard to nail down the benefit 
of though, because in a typical case the OS caching here trumps any I/O 
scheduling you try to do in user land, and it's hard to repeatibly generate 
scattered data in a benchmark situation.

Ok, on a basic insert test and a systemtap script 
(http://www.wiesinger.com/opensource/systemtap/postgresql-checkpoint.stp) 
checkpoint is still a major I/O spike.

################################################################################
Buffers between : Sun Oct  4 18:29:50 2009, synced 55855 buffer(s), flushed 744 buffer(s) between checkpoint
Checkpoint start: Sun Oct  4 18:29:50 2009
Checkpoint end  : Sun Oct  4 18:29:56 2009, synced 12031 buffer(s), flushed 12031 buffer(s)
################################################################################
Buffers between : Sun Oct  4 18:30:20 2009, synced 79000 buffer(s), flushed 0 buffer(s) between checkpoint
Checkpoint start: Sun Oct  4 18:30:20 2009
Checkpoint end  : Sun Oct  4 18:30:26 2009, synced 10753 buffer(s), flushed 10753 buffer(s)
################################################################################
Buffers between : Sun Oct  4 18:30:50 2009, synced 51120 buffer(s), flushed 1007 buffer(s) between checkpoint
Checkpoint start: Sun Oct  4 18:30:50 2009
Checkpoint end  : Sun Oct  4 18:30:56 2009, synced 11899 buffer(s), flushed 11912 buffer(s)
################################################################################

Ok, I further had a look at the code to understand the behavior of the 
buffercache and the background writer since that wasn't logically.

So as far as I saw the basic algorithm is:
1.) Normally (non checkpoints) only dirty and non recently used pages 
(usage_count == 0) are flushed to disk. I think that's basically fine as a 
strategy as indexes might update blocks more than once. It's also ok that 
blocks are written and not flushed (well be done on checkpoint time).
2.) At checkpoints write out all dirty buffers and flush all previously 
written and newly written. Also spreading I/O seems also ok to me now.

BUT: I think I've found 2 major bugs in the implementation (or I didn't 
understand something correctly). Codebase analyzed is 8.3.8 since I 
currently use it.

##############################################
Bug1: usage_count is IHMO not consistent
##############################################
I think this has been introduced with:
http://git.postgresql.org/gitweb?p=postgresql.git;a=blobdiff;f=src/backend/storage/buffer/bufmgr.c;h=6e6b862273afea40241e410e18fd5d740c2b1643;hp=97f7822077de683989a064cdc624a025f85e54ab;hb=ebf3d5b66360823edbdf5ac4f9a119506fccd4c0;hpb=98ffa4e9bd75c8124378c712933bb13d2697b694

So either usage_count = 1 init in BufferAlloc is not correct or 
SyncOneBuffer() with skip_recently_used and usage_count=1 is not correct:
        if (bufHdr->refcount == 0 && bufHdr->usage_count == 0)
                result |= BUF_REUSABLE;
        else if (skip_recently_used)
        {
                /* Caller told us not to write recently-used buffers */
                UnlockBufHdr(bufHdr);
                return result;
        }

##############################################
Bug2: Double iteration of buffers
##############################################
As you can seen in the calling tree below there is double iteration 
with buffers involved. This might be a major performance bottleneck.

// Checkpoint buffer sync
BufferSync()
  loop buffers:
    SyncOneBuffer() // skip_recently_used=false
    CheckpointWriteDelay() // Bug here?: Since BgBufferSync() is called were again is iterated!!

CheckpointWriteDelay()
  if (IsCheckpointOnSchedule())
  { BgBufferSync()
    CheckArchiveTimeout()
    BgWriterNap()
  }

BgBufferSync()
  loop buffers:
    SyncOneBuffer() // skip_recently_used=true, ok here since we don't want to flush recently used block (e.g. indices). But improvement (e.g. aging) is IHMO necessary
##############################################

BTW: Are there some tests available how fast a buffer cache hit is and a 
disk cache hit is (not in the buffer cache but in the disk cache)? I'll 
asked, because a lot of locking is involved in the code.

BTW2: Oracle buffercache and background writer strategy is also 
interesting.
http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/process.htm#i7259 
http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/memory.htm#i10221

Thnx for feedback.

Ciao,
Gerhard

--
http://www.wiesinger.com/

-----------------------------------
src/backend/postmaster/bgwriter.c
-----------------------------------
BackgroundWriterMain()
  loop forever:
    timeout:
      CreateCheckPoint() // NON_IMMEDIATE
      smgrcloseall()
    nontimeout:
      BgBufferSync()
    sleep
// Rest is done in XLogWrite()

RequestCheckpoint()
  CreateCheckPoint() or signal through shared memory segment
  smgrcloseall()

CheckpointWriteDelay()
  if (IsCheckpointOnSchedule())
  { BgBufferSync()
    CheckArchiveTimeout()
    BgWriterNap()
  }

-----------------------------------
src/backend/commands/dbcommands.c
-----------------------------------
createdb()
  RequestCheckpoint() // CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT

dropdb()
  RequestCheckpoint() // CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-----------------------------------
src/backend/commands/tablespace.c
-----------------------------------
DropTableSpace()
  RequestCheckpoint()

-----------------------------------
src/backend/tcop/utility.c
-----------------------------------
ProcessUtility()
  // Command CHECKPOINT;
  RequestCheckpoint() // CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT

-----------------------------------
src/backend/access/transam/xlog.c
-----------------------------------
CreateCheckPoint()
  CheckPointGuts()
    CheckPointCLOG()
    CheckPointSUBTRANS()
    CheckPointMultiXact()
    CheckPointBuffers());       /* performs all required fsyncs */
    CheckPointTwoPhase()

XLogWrite()
  too_much_transaction_log_consumed:
    RequestCheckpoint() // NON_IMMEDIATE

pg_start_backup()
  RequestCheckpoint() // CHECKPOINT_FORCE | CHECKPOINT_WAIT

XLogFlush()
  // Flush transaction log

-----------------------------------
src/backend/storage/buffer/bufmgr.c
-----------------------------------
CheckPointBuffers()
  BufferSync()
  smgrsync()

// Checkpoint buffer sync
BufferSync()
  loop buffers:
    SyncOneBuffer() // skip_recently_used=false
    CheckpointWriteDelay() // Bug here?: Since BgBufferSync() is called were again is iterated!!

// Backgroundwriter buffer sync
BgBufferSync()
  loop buffers:
    SyncOneBuffer() // skip_recently_used=true, ok here since we don't want to flush recently used block (e.g. indices). But improvement (e.g. aging) is IHMO necessary

SyncOneBuffer() // Problem with skip_recently_used and usage_count=1 (not flushed!)
   FlushBuffer()

FlushBuffer()
  XLogFlush()
  smgrwrite()

CheckPointBuffers()
  BufferSync()
  smgrsync()

BufferAlloc() // Init with usage_count=1 is not logically => Will never be flushed in bg_writer!
  PinBuffer()

PinBuffer()
  usage_count++;

-----------------------------------
src/backend/storage/buffer/localbuf.c
-----------------------------------
LocalBufferAlloc()
  usage_count++;

-----------------------------------
src/backend/storage/smgr/md.c
-----------------------------------
smgrwrite() = mdwrite()
  => write file (not flushed immediatly) but registers for later flushling with register_dirty_segment at checkpoint time

smgrsync() = mdsync()
  => Syncs registered non flushed files.

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general