Re: Occasional Slow Commit

"David Rees" <drees76@xxxxxxxxx> · Fri, 31 Oct 2008 17:14:28 -0700

(Resending this, the first one got bounced by mail.postgresql.org)

On Wed, Oct 29, 2008 at 3:30 PM, David Rees <drees76@xxxxxxxxx> wrote:
> On Wed, Oct 29, 2008 at 6:26 AM, Greg Smith <gsmith@xxxxxxxxxxxxx> wrote:
>> What you should do first is confirm
>> whether or not the slow commits line up with the end of the checkpoint,
>> which is easy to see if you turn on log_checkpoints.  That gives you timings
>> for the write and fsync phases of the checkpoint which can also be
>> informative.
>
> OK, log_checkpoints is turned on to see if any delays correspond to
> checkpoint activity...

Well, I'm pretty sure the delays are not checkpoint related. None of
the slow commits line up at all with the end of checkpoints.

The period of high delays occur during the same period of time each
week, but it's not during a particularly high load period on the
systems.

It really seems like there must be something running in the background
that is not showing up on the system activity logs, like a background
RAID scrub or something.

Here are a couple representative checkpoint complete messages from the logs:

2008-10-31 20:12:03 UTC: : : LOG:  checkpoint complete: wrote 285
buffers (0.3%); 0 transaction log file(s) added, 0 removed, 0
recycled; write=57.933 s, sync=0.053 s, total=57.990 s
2008-10-31 20:17:33 UTC: : : LOG:  checkpoint complete: wrote 437
buffers (0.4%); 0 transaction log file(s) added, 0 removed, 0
recycled; write=87.891 s, sync=0.528 s, total=88.444 s
2008-10-31 20:22:05 UTC: : : LOG:  checkpoint complete: wrote 301
buffers (0.3%); 0 transaction log file(s) added, 0 removed, 1
recycled; write=60.774 s, sync=0.033 s, total=60.827 s
2008-10-31 20:27:46 UTC: : : LOG:  checkpoint complete: wrote 504
buffers (0.5%); 0 transaction log file(s) added, 0 removed, 0
recycled; write=101.037 s, sync=0.049 s, total=101.122 s

During this period of time there was probably 100 different queries
writing/commiting data that took longer than a second (3-4 seconds
seems to be the worst).

The RAID controller on this machine is some sort of MegaRAID
controller. I'll have to see if there is some sort of scheduled scan
running during this period of time.

One of the replicate nodes is an identical machine which I just
noticed has the same slony commit slow downs logged even though it's
only receiving data from slony from the primary node. There are two
other nodes listening in on the same subscription, but these two nodes
don't show the same slony commit slow downs, but these machines use a
slightly different raid controller and are about 9 months newer than
the other two.

I'm still hoping that the checkpoint tuning has reduced commit latency
during busy periods, I should have more data after the weekend.

-Dave

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance