Re: Excessive (and slow) fsync() within single transaction

Stephen Tyler <stephen@xxxxxxxxxxxxxxxxx> · Thu, 10 Dec 2009 21:21:36 +1100

On Wed, Dec 9, 2009 at 12:57 PM, Greg Smith <greg@xxxxxxxxxxxxxxx> wrote:

You should turn on log_checkpoint in the postgresql.conf and confirm the slowdowns are happening around the same time as the checkpoint report gets written to the log files.

I have turned on log_checkpoints, and re-run the command.  Checkpoints are being written every 220 to 360 seconds.  About 80% are "time" and 20% "xlog".  Here are a representative sample:

10/12/09 5:12:59 PM    org.postgresql.postgres[445]    LOG:  checkpoint starting: time
10/12/09 5:16:41 PM    org.postgresql.postgres[445]    LOG:  checkpoint complete: wrote 133795 buffers (25.5%); 0 transaction log file(s) added, 0 removed, 98 recycled; write=112.281 s, sync=108.809 s, total=221.166 s

10/12/09 5:17:32 PM    org.postgresql.postgres[445]    LOG:  checkpoint starting: xlog
10/12/09 5:23:45 PM    org.postgresql.postgres[445]    LOG:  checkpoint complete: wrote 292601 buffers (55.8%); 0 transaction log file(s) added, 0 removed, 76 recycled; write=162.919 s, sync=209.277 s, total=372.229 s

10/12/09 5:23:45 PM    org.postgresql.postgres[445]    LOG:  checkpoint starting: time
10/12/09 5:27:58 PM    org.postgresql.postgres[445]    LOG:  checkpoint complete: wrote 113330 buffers (21.6%); 0 transaction log file(s) added, 0 removed, 128 recycled; write=165.252 s, sync=87.588 s, total=253.285 s

10/12/09 5:28:45 PM    org.postgresql.postgres[445]    LOG:  checkpoint starting: time
10/12/09 5:34:11 PM    org.postgresql.postgres[445]    LOG:  checkpoint complete: wrote 206947 buffers (39.5%); 0 transaction log file(s) added, 0 removed, 65 recycled; write=152.953 s, sync=172.978 s, total=326.460 s

So a "typical" checkpoint is around 200K buffers (1.5GBytes, 40% of max), taking around 150 seconds to write (10MByte/second, 1300 buffers/second), and around 150 seconds to sync.

The system pauses/freezes are much more frequent.  Roughly 20 to 60 seconds long, interspersed with a brief pulse of activity.  They align with the lengthy fsync calls reported by DTrace.  But there are many more fsync calls being made.  Most fsync calls return in a couple of milliseconds or microseconds.

As I have previously mentioned, my system is using the default "open_datasync" WAL sync method.  The test_fsync utility reports a small "open o_dsync, write" call returning in around 300 microseconds or less, both on the SSD RAID 0 array and a 7200rpm SATA HDD.  If that utility is to be believed, "open o_dsync, write" is not reaching the disk platters by the time it returns.

Why then does it take postgresql 200seconds to sync a checkpoint (or DTrace report 20 to 60 second fsync calls)?  The drives themselves have only smallish 32MB buffers.  I can write a 1.5GByte file (the size of the average checkpoint) in only 10 seconds, if I do it from outside postgresql.

Stephen