On Mon, 27 Oct 2008, David Rees wrote:
Software: CentOS 4.7, PostgreSQL 8.3.4, Slony-I 1.2.15 (the database in question is replicated using slony) Hardware: 2x Xeon 5130, 4GB RAM, 6-disk RAID10 15k RPM, BBU on the controller
The CentOS 4.7 kernel will happily buffer about 1.6GB of writes with that much RAM, and the whole thing can get slammed onto disk during the final fsync portion of the checkpoint. What you should do first is confirm whether or not the slow commits line up with the end of the checkpoint, which is easy to see if you turn on log_checkpoints. That gives you timings for the write and fsync phases of the checkpoint which can also be informative.
Reading this page[2] indicates that I may want to increase my checkpoint_segments, checkpoint_timeout and bgwriter settings, but looking at pg_stat_bgwriter seems to indicate that my current settings are probably OK? # select * from pg_stat_bgwriter; checkpoints_timed | checkpoints_req | buffers_checkpoint | 3834 | 105 | 3,091,905 | buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc 25876 | 110 | 2,247,576 | 2,889,873
I reformatted the above to show what's happening a bit better. Most of your checkpoints are the timed ones, which are unlikely to cause interference from a slow commit period (the writes are spread out over 5 minutes in those cases). It's quite possible the slow periods are coming only from the occasional requested checkpoints, which normally show up because checkpoint_segments is too low and you chew through segments too fast. If you problems line up with checkpoint time, you would likely gain some benefit from increasing checkpoint_segments to spread out the checkpoint writes further; the 10 you're using now is on the low side for your hardware.
If the problems don't go away after that, you may be suffering from excessive Linux kernel buffering instead. I've got a blog entry showing how I tracked down a similar long pause on a Linux server at http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html you may find helpful for determining if your issue is this one (which is pretty common on RHEL systems having relatively large amounts of RAM) or if it's something else, like the locking you mentioned.
-- * Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance