(Resending this, the first one got bounced by mail.postgresql.org) On Wed, Oct 29, 2008 at 3:30 PM, David Rees <drees76@xxxxxxxxx> wrote: > On Wed, Oct 29, 2008 at 6:26 AM, Greg Smith <gsmith@xxxxxxxxxxxxx> wrote: >> What you should do first is confirm >> whether or not the slow commits line up with the end of the checkpoint, >> which is easy to see if you turn on log_checkpoints. That gives you timings >> for the write and fsync phases of the checkpoint which can also be >> informative. > > OK, log_checkpoints is turned on to see if any delays correspond to > checkpoint activity... Well, I'm pretty sure the delays are not checkpoint related. None of the slow commits line up at all with the end of checkpoints. The period of high delays occur during the same period of time each week, but it's not during a particularly high load period on the systems. It really seems like there must be something running in the background that is not showing up on the system activity logs, like a background RAID scrub or something. Here are a couple representative checkpoint complete messages from the logs: 2008-10-31 20:12:03 UTC: : : LOG: checkpoint complete: wrote 285 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=57.933 s, sync=0.053 s, total=57.990 s 2008-10-31 20:17:33 UTC: : : LOG: checkpoint complete: wrote 437 buffers (0.4%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=87.891 s, sync=0.528 s, total=88.444 s 2008-10-31 20:22:05 UTC: : : LOG: checkpoint complete: wrote 301 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=60.774 s, sync=0.033 s, total=60.827 s 2008-10-31 20:27:46 UTC: : : LOG: checkpoint complete: wrote 504 buffers (0.5%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=101.037 s, sync=0.049 s, total=101.122 s During this period of time there was probably 100 different queries writing/commiting data that took longer than a second (3-4 seconds seems to be the worst). The RAID controller on this machine is some sort of MegaRAID controller. I'll have to see if there is some sort of scheduled scan running during this period of time. One of the replicate nodes is an identical machine which I just noticed has the same slony commit slow downs logged even though it's only receiving data from slony from the primary node. There are two other nodes listening in on the same subscription, but these two nodes don't show the same slony commit slow downs, but these machines use a slightly different raid controller and are about 9 months newer than the other two. I'm still hoping that the checkpoint tuning has reduced commit latency during busy periods, I should have more data after the weekend. -Dave -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance