Re: Occasional Slow Commit

Greg Smith <gsmith@xxxxxxxxxxxxx> · Wed, 29 Oct 2008 09:26:32 -0400 (EDT)

On Mon, 27 Oct 2008, David Rees wrote:

Software: CentOS 4.7, PostgreSQL 8.3.4, Slony-I 1.2.15 (the database
in question is replicated using slony)
Hardware: 2x Xeon 5130, 4GB RAM, 6-disk RAID10 15k RPM, BBU on the controller

The CentOS 4.7 kernel will happily buffer about 1.6GB of writes with that 
much RAM, and the whole thing can get slammed onto disk during the final 
fsync portion of the checkpoint.  What you should do first is confirm 
whether or not the slow commits line up with the end of the checkpoint, 
which is easy to see if you turn on log_checkpoints.  That gives you 
timings for the write and fsync phases of the checkpoint which can also be 
informative.

Reading this page[2] indicates that I may want to increase my
checkpoint_segments, checkpoint_timeout and bgwriter settings, but
looking at pg_stat_bgwriter seems to indicate that my current settings
are probably OK?

# select * from pg_stat_bgwriter;
checkpoints_timed | checkpoints_req | buffers_checkpoint |
             3834 |             105 |            3,091,905 |
buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc
25876 |              110 |         2,247,576 |       2,889,873

I reformatted the above to show what's happening a bit better.  Most of 
your checkpoints are the timed ones, which are unlikely to cause 
interference from a slow commit period (the writes are spread out over 5 
minutes in those cases).  It's quite possible the slow periods are coming 
only from the occasional requested checkpoints, which normally show up 
because checkpoint_segments is too low and you chew through segments too 
fast.  If you problems line up with checkpoint time, you would likely gain 
some benefit from increasing checkpoint_segments to spread out the 
checkpoint writes further; the 10 you're using now is on the low side for 
your hardware.

If the problems don't go away after that, you may be suffering from 
excessive Linux kernel buffering instead.  I've got a blog entry showing 
how I tracked down a similar long pause on a Linux server at 
http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html you 
may find helpful for determining if your issue is this one (which is 
pretty common on RHEL systems having relatively large amounts of RAM) or 
if it's something else, like the locking you mentioned.

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance