checkpoints/bgwriter tuning verification

Vick Khera <vivek@xxxxxxxxx> · Thu, 29 Oct 2009 14:46:13 -0400

On my primary DB I'm observing random slowness which just doesn't make
sense to me.  The I/O system can easily do 40MB/sec writes, but I'm
only seeing a sustained 5MB/sec, even as the application is stalling
waiting on the DB.

My only guess is that I'm getting hit by checkpoints too often, and
this is causing some delays.  I did a lot of reading and came across
this most excelelent article by Greg Smith
(http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm).
Reading that and the Pg manual and looking at the statistics, here is
what I think I need to do.  (Greg, do you have a performance tuning
book?  If so, I really want to buy it!  Your articles are awesome.)

current settings for checkpoints:
checkpoint_segments = 32
checkpoint_timeout = 900
checkpoint_completion_target = 0.9
log_checkpoints = on

In my logs I see this pretty consistently during times of slowness:

Oct 29 14:17:38 d04 postgres[54902]: [10990-1] LOG:  checkpoint starting: xlog
Oct 29 14:22:04 d04 postgres[54902]: [10991-1] LOG:  checkpoint
complete: wrote 52828 buffers (24.3%); 0 transaction log file(s)
added, 0 removed, 32 recycled; write=265.881
Oct 29 14:22:57 d04 postgres[54902]: [10992-1] LOG:  checkpoint starting: xlog
Oct 29 14:26:56 d04 postgres[54902]: [10993-1] LOG:  checkpoint
complete: wrote 52773 buffers (24.3%); 0 transaction log file(s)
added, 0 removed, 32 recycled; write=234.846
Oct 29 14:28:32 d04 postgres[54902]: [10994-1] LOG:  checkpoint starting: xlog
Oct 29 14:33:32 d04 postgres[54902]: [10995-1] LOG:  checkpoint
complete: wrote 53807 buffers (24.7%); 0 transaction log file(s)
added, 0 removed, 32 recycled; write=299.170

basically, the next checkpoint starts within a few seconds of the
prior one completing.

The stats show this:

# select * from pg_stat_bgwriter;
-[ RECORD 1 ]------+-----------
checkpoints_timed  | 3236
checkpoints_req    | 83044
buffers_checkpoint | 1376460896
buffers_clean      | 59124159
maxwritten_clean   | 304410
buffers_backend    | 285595787
buffers_alloc      | 6643047623

Based on Greg's article and the above number showing that most
checkpoints are triggered by running out of WAL segments, I should
increase my checkpoint_buffers.  Also, based on the article, I should
increase the bgwriter_lru_maxpages (currently at default 100).

My plan is to bump checkpoint_segments to 48 and reduce
checkpoint_completion_target to 0.7, and bump the
bgwriter_lru_maxpages to 500.

Can the checkpoint operation actually cause the DB to stop responding
for a few seconds at a time?  That seems to be what I observe.
Sometimes for 5 or more seconds one transaction will just stall.

Thanks for any ideas.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general