Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

Scott Carey <scott@xxxxxxxxxxxxxxxxx> · Wed, 17 Nov 2010 12:19:10 -0800

On Nov 16, 2010, at 12:39 PM, Greg Smith wrote:
> 
> $ ./test_fsync
> Loops = 10000
> 
> Simple write:
>    8k write                      88476.784/second
> 
> Compare file sync methods using one write:
>    (unavailable: open_datasync)
>    open_sync 8k write             1192.135/second
>    8k write, fdatasync            1222.158/second
>    8k write, fsync                1097.980/second
> 
> Compare file sync methods using two writes:
>    (unavailable: open_datasync)
>    2 open_sync 8k writes           527.361/second
>    8k write, 8k write, fdatasync  1105.204/second
>    8k write, 8k write, fsync      1084.050/second
> 
> Compare open_sync with different sizes:
>    open_sync 16k write             966.047/second
>    2 open_sync 8k writes           529.565/second
> 
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
>    8k write, fsync, close         1064.177/second
>    8k write, close, fsync         1042.337/second
> 
> Two notable things here.  One, there is no open_datasync defined in this
> older kernel.  Two, all methods of commit give equally inflated commit
> rates, far faster than the drive is capable of.  This proves this setup
> isn't flushing the drive's write cache after commit.

Nit: there is no open_sync, only open_dsync.  Prior to recent kernels, only (semantically) open_dsync exists, labeled as open_sync.  New kernels move that code to open_datasync and nave a NEW open_sync that supposedly flushes metadata properly.   

> 
> You can get safe behavior out of the old kernel by disabling its write
> cache:
> 
> $ sudo /sbin/hdparm -W0 /dev/sda
> 
> /dev/sda:
> setting drive write-caching to 0 (off)
> write-caching =  0 (off)
> 
> Loops = 10000
> 
> Simple write:
>    8k write                      89023.413/second
> 
> Compare file sync methods using one write:
>    (unavailable: open_datasync)
>    open_sync 8k write              106.968/second
>    8k write, fdatasync             108.106/second
>    8k write, fsync                 104.238/second
> 
> Compare file sync methods using two writes:
>    (unavailable: open_datasync)
>    2 open_sync 8k writes            51.637/second
>    8k write, 8k write, fdatasync   109.256/second
>    8k write, 8k write, fsync       103.952/second
> 
> Compare open_sync with different sizes:
>    open_sync 16k write             109.562/second
>    2 open_sync 8k writes            52.752/second
> 
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
>    8k write, fsync, close          107.179/second
>    8k write, close, fsync          106.923/second
> 
> And now results are as expected:  just under 120/second.
> 
> Onto RHEL6.  Setup for this initial test was:
> 
> $ uname -a
> Linux meddle 2.6.32-44.1.el6.x86_64 #1 SMP Wed Jul 14 18:51:29 EDT 2010
> x86_64 x86_64 x86_64 GNU/Linux
> $ cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 6.0 Beta (Santiago)
> $ mount
> /dev/sda7 on / type ext4 (rw)
> 
> And I started with the write cache off to see a straight comparison
> against the above:
> 
> $ sudo hdparm -W0 /dev/sda
> 
> /dev/sda:
> setting drive write-caching to 0 (off)
> write-caching =  0 (off)
> $ ./test_fsync
> Loops = 10000
> 
> Simple write:
>    8k write                      104194.886/second
> 
> Compare file sync methods using one write:
>    open_datasync 8k write           97.828/second
>    open_sync 8k write              109.158/second
>    8k write, fdatasync             109.838/second
>    8k write, fsync                  20.872/second

fsync is working now!  flushing metadata properly reduces performance.
However, shouldn't open_sync slow down vs open_datasync too and be similar to fsync?

Did you recompile your test on the RHEL6 system?  
Code compiled on newer kernels will see O_DSYNC and O_SYNC as two separate sentinel values, lets call them 1 and 2 respectively.  Code compiled against earlier kernels will see both O_DSYNC and O_SYNC as the same value, 1.  So code compiled against older kernels, asking for O_SYNC on a newer kernel will actually get O_DSYNC behavior!  This was intended.  I can't find the link to the mail, but it was Linus' idea to make old code that expected the 'faster but incorrect' behavior to retain it on newer kernels.  Only a recompile with newer header files will trigger the new behavior and expose the 'correct' open_sync behavior.

This will be 'fun' for postgres packagers and users -- data reliability behavior differs based on what kernel it is compiled against.  Luckily, the xlogs only need open_datasync semantics.

> 
> Compare file sync methods using two writes:
>    2 open_datasync 8k writes        53.902/second
>    2 open_sync 8k writes            53.721/second
>    8k write, 8k write, fdatasync   109.731/second
>    8k write, 8k write, fsync        20.918/second
> 
> Compare open_sync with different sizes:
>    open_sync 16k write             109.552/second
>    2 open_sync 8k writes            54.116/second
> 
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
>    8k write, fsync, close           20.800/second
>    8k write, close, fsync           20.868/second
> 
> A few changes then.  open_datasync is available now.  

Again, noting the detail that it is open_sync that is new (depending on where it is compiled).  The old open_sync is relabeled to the new open_datasync. 

> It looks slightly
> slower than the alternatives on this test, but I didn't see that on the
> later tests so I'm thinking that's just occasional run to run
> variation.  For some reason regular fsync is dramatically slower in this
> kernel than earlier ones.  Perhaps a lot more metadata being flushed all
> the way to the disk in that case now?
> 
> The issue that I think Marti has been concerned about is highlighted in
> this interesting subset of the data:
> 
> Compare file sync methods using two writes:
>    2 open_datasync 8k writes        53.902/second
>    8k write, 8k write, fdatasync   109.731/second
> 
> The results here aren't surprising; if you do two dsync writes, that
> will take two disk rotations, while two writes followed a single sync
> only takes one.  But that does mean that in the case of small values for
> wal_buffers, like the default, you could easily end up paying a rotation
> sync penalty more than once per commit.
> 
> Next question is what happens if I turn the drive's write cache back on:
> 
> $ sudo hdparm -W1 /dev/sda
> 
> /dev/sda:
> setting drive write-caching to 1 (on)
> write-caching =  1 (on)
> 
> $ ./test_fsync
> 
> [gsmith@meddle fsync]$ ./test_fsync
> Loops = 10000
> 
> Simple write:
>    8k write                      104198.143/second
> 
> Compare file sync methods using one write:
>    open_datasync 8k write          110.707/second
>    open_sync 8k write              110.875/second
>    8k write, fdatasync             110.794/second
>    8k write, fsync                  28.872/second
> 
> Compare file sync methods using two writes:
>    2 open_datasync 8k writes        55.731/second
>    2 open_sync 8k writes            55.618/second
>    8k write, 8k write, fdatasync   110.551/second
>    8k write, 8k write, fsync        28.843/second
> 
> Compare open_sync with different sizes:
>    open_sync 16k write             110.176/second
>    2 open_sync 8k writes            55.785/second
> 
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
>    8k write, fsync, close           28.779/second
>    8k write, close, fsync           28.855/second
> 
> This is nice to see from a reliability perspective.  On all three of the
> viable sync methods here, the speed seen suggests the drive's volatile
> write cache is being flushed after every commit.  This is going to be
> bad for people who have gotten used to doing development on systems
> where that's not honored and they don't care, because this looks like a
> 90% drop in performance on those systems.
>  But since the new behavior is
> safe and the earlier one was not, it's hard to get mad about it.

I would love to see the same tests in this detail for RHEL 5.5 (which has ext3, ext4, and xfs).  I think this data reliability issue that requires turning off write cache was in the kernel ~2.6.26 to 2.6.31 range.  Ubuntu doesn't really care about this stuff which is one reason I avoid it for a prod db.  I know that xfs with the right settings on RHEL 5.5 does not require disabling the write cache.

> Developers probably just need to be taught to turn synchronous_commit
> off to speed things up when playing with test data.
> 

Absolutely.

> test_fsync writes to /var/tmp/test_fsync.out by default, not paying
> attention to what directory you're in.  So to use it to test another
> filesystem, you have to make sure to give it an explicit full path.
> Next I tested against the old Ubuntu partition that was formatted with
> ext3, with the write cache still on:
> 
> # mount | grep /ext3
> /dev/sda5 on /ext3 type ext3 (rw)
> # ./test_fsync -f /ext3/test_fsync.out
> Loops = 10000
> 
> Simple write:
>    8k write                      100943.825/second
> 
> Compare file sync methods using one write:
>    open_datasync 8k write          106.017/second
>    open_sync 8k write              108.318/second
>    8k write, fdatasync             108.115/second
>    8k write, fsync                 105.270/second
> 
> Compare file sync methods using two writes:
>    2 open_datasync 8k writes        53.313/second
>    2 open_sync 8k writes            54.045/second
>    8k write, 8k write, fdatasync    55.291/second
>    8k write, 8k write, fsync        53.243/second
> 
> Compare open_sync with different sizes:
>    open_sync 16k write              54.980/second
>    2 open_sync 8k writes            53.563/second
> 
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
>    8k write, fsync, close          105.032/second
>    8k write, close, fsync          103.987/second
> 
> Strange...it looks like ext3 is executing cache flushes, too.  Note that
> all of the "Compare file sync methods using two writes" results are half
> speed now; it's as if ext3 is flushing the first write out immediately?
> This result was unexpected, and I don't trust it yet; I want to validate
> this elsewhere.
> 
> What about XFS?  That's a first class filesystem on RHEL6 too:
and available on later RHEL 5's.
> 
> [root@meddle fsync]# ./test_fsync -f /xfs/test_fsync.out
> Loops = 10000
> 
> Simple write:
>    8k write                      71878.324/second
> 
> Compare file sync methods using one write:
>    open_datasync 8k write           36.303/second
>    open_sync 8k write               35.714/second
>    8k write, fdatasync              35.985/second
>    8k write, fsync                  35.446/second
> 
> I stopped that there, sick of waiting for it, as there's obviously some
> serious work (mounting options or such at a minimum) that needs to be
> done before XFS matches the other two.  Will return to that later.
> 

Yes, XFS requires some fiddling.  Its metadata operations are also very slow.

> So, what have we learned so far:
> 
> 1) On these newer kernels, both ext4 and ext3 seem to be pushing data
> out through the drive write caches correctly.
> 

I suspect that some older kernels are partially OK here too.  The kernel not flushing properly appeared near 2.6.25 ish.

> 2) On single writes, there's no performance difference between the main
> three methods you might use, with the straight fsync method having a
> serious regression in this use case.

I'll ask again -- did you compile the test on RHEL6 for the RHEL6 tests?  The behavior in later kernels for this depends on what kernel it was compiled against for open_sync.  For fsync, its not a regression, its actually flushing metadata properly and therefore actually robust if there is a power failure during a write.  Even the write cache disabled case on the ubuntu kernel could leave a filesystem with corrupt data if the power failed in a metadata intensive write situation. 

> 
> 3) WAL writes that are forced by wal_buffers filling will turn into a
> commit-length write when using the new, default open_datasync.  Using
> the older default of fdatasync avoids that problem, in return for
> causing WAL writes to pollute the OS cache.  The main benefit of O_DSYNC
> writes over fdatasync ones is avoiding the OS cache.
> 
> I want to next go through and replicate some of the actual database
> level tests before giving a full opinion on whether this data proves
> it's worth changing the wal_sync_method detection.  So far I'm torn
> between whether that's the right approach, or if we should just increase
> the default value for wal_buffers to something more reasonable.
> 
> --
> Greg Smith   2ndQuadrant US    greg@xxxxxxxxxxxxxxx   Baltimore, MD
> PostgreSQL Training, Services and Support        www.2ndQuadrant.us
> "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
> 
> 
> --
> Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance