Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Time for a deeper look at what's going on here...I installed RHEL6 Beta 2 yesterday, on the presumption that since the release version just came out this week it was likely the same version Marti tested against. Also, it was the one I already had a DVD to install for. This was on a laptop with 7200 RPM hard drive, already containing an Ubuntu installation for comparison sake.

Initial testing was done with the PostgreSQL test_fsync utility, just to get a gross idea of what situations the drives involved were likely flushing data to disk correctly during, and which it was impossible for that to be true. 7200 RPM = 120 rotations/second, which puts an upper limit of 120 true fsync executions per second. The test_fsync released with PostgreSQL 9.0 now reports its value on the right scale that you can directly compare against that (earlier versions reported seconds/commit, not commits/second).

First I built test_fsync from inside of an existing PostgreSQL 9.1 HEAD checkout:

$ cd [PostgreSQL source code tree]
$ cd src/tools/fsync/
$ make

And I started with looking at the Ubuntu system running ext3, which represents the status quo we've been seeing the past few years. Initially the drive write cache was turned on:

Linux meddle 2.6.28-19-generic #61-Ubuntu SMP Wed May 26 23:35:15 UTC 2010 i686 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=9.04
DISTRIB_CODENAME=jaunty
DISTRIB_DESCRIPTION="Ubuntu 9.04"

/dev/sda5 on / type ext3 (rw,relatime,errors=remount-ro)

$ ./test_fsync
Loops = 10000

Simple write:
   8k write                      88476.784/second

Compare file sync methods using one write:
   (unavailable: open_datasync)
   open_sync 8k write             1192.135/second
   8k write, fdatasync            1222.158/second
   8k write, fsync                1097.980/second

Compare file sync methods using two writes:
   (unavailable: open_datasync)
   2 open_sync 8k writes           527.361/second
   8k write, 8k write, fdatasync  1105.204/second
   8k write, 8k write, fsync      1084.050/second

Compare open_sync with different sizes:
   open_sync 16k write             966.047/second
   2 open_sync 8k writes           529.565/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
   8k write, fsync, close         1064.177/second
   8k write, close, fsync         1042.337/second

Two notable things here. One, there is no open_datasync defined in this older kernel. Two, all methods of commit give equally inflated commit rates, far faster than the drive is capable of. This proves this setup isn't flushing the drive's write cache after commit.

You can get safe behavior out of the old kernel by disabling its write cache:

$ sudo /sbin/hdparm -W0 /dev/sda

/dev/sda:
setting drive write-caching to 0 (off)
write-caching =  0 (off)

Loops = 10000

Simple write:
   8k write                      89023.413/second

Compare file sync methods using one write:
   (unavailable: open_datasync)
   open_sync 8k write              106.968/second
   8k write, fdatasync             108.106/second
   8k write, fsync                 104.238/second

Compare file sync methods using two writes:
   (unavailable: open_datasync)
   2 open_sync 8k writes            51.637/second
   8k write, 8k write, fdatasync   109.256/second
   8k write, 8k write, fsync       103.952/second

Compare open_sync with different sizes:
   open_sync 16k write             109.562/second
   2 open_sync 8k writes            52.752/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
   8k write, fsync, close          107.179/second
   8k write, close, fsync          106.923/second

And now results are as expected:  just under 120/second.

Onto RHEL6.  Setup for this initial test was:

$ uname -a
Linux meddle 2.6.32-44.1.el6.x86_64 #1 SMP Wed Jul 14 18:51:29 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.0 Beta (Santiago)
$ mount
/dev/sda7 on / type ext4 (rw)

And I started with the write cache off to see a straight comparison against the above:

$ sudo hdparm -W0 /dev/sda

/dev/sda:
setting drive write-caching to 0 (off)
write-caching =  0 (off)
$ ./test_fsync
Loops = 10000

Simple write:
   8k write                      104194.886/second

Compare file sync methods using one write:
   open_datasync 8k write           97.828/second
   open_sync 8k write              109.158/second
   8k write, fdatasync             109.838/second
   8k write, fsync                  20.872/second

Compare file sync methods using two writes:
   2 open_datasync 8k writes        53.902/second
   2 open_sync 8k writes            53.721/second
   8k write, 8k write, fdatasync   109.731/second
   8k write, 8k write, fsync        20.918/second

Compare open_sync with different sizes:
   open_sync 16k write             109.552/second
   2 open_sync 8k writes            54.116/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
   8k write, fsync, close           20.800/second
   8k write, close, fsync           20.868/second

A few changes then. open_datasync is available now. It looks slightly slower than the alternatives on this test, but I didn't see that on the later tests so I'm thinking that's just occasional run to run variation. For some reason regular fsync is dramatically slower in this kernel than earlier ones. Perhaps a lot more metadata being flushed all the way to the disk in that case now?

The issue that I think Marti has been concerned about is highlighted in this interesting subset of the data:

Compare file sync methods using two writes:
   2 open_datasync 8k writes        53.902/second
   8k write, 8k write, fdatasync   109.731/second

The results here aren't surprising; if you do two dsync writes, that will take two disk rotations, while two writes followed a single sync only takes one. But that does mean that in the case of small values for wal_buffers, like the default, you could easily end up paying a rotation sync penalty more than once per commit.

Next question is what happens if I turn the drive's write cache back on:

$ sudo hdparm -W1 /dev/sda

/dev/sda:
setting drive write-caching to 1 (on)
write-caching =  1 (on)

$ ./test_fsync

[gsmith@meddle fsync]$ ./test_fsync
Loops = 10000

Simple write:
   8k write                      104198.143/second

Compare file sync methods using one write:
   open_datasync 8k write          110.707/second
   open_sync 8k write              110.875/second
   8k write, fdatasync             110.794/second
   8k write, fsync                  28.872/second

Compare file sync methods using two writes:
   2 open_datasync 8k writes        55.731/second
   2 open_sync 8k writes            55.618/second
   8k write, 8k write, fdatasync   110.551/second
   8k write, 8k write, fsync        28.843/second

Compare open_sync with different sizes:
   open_sync 16k write             110.176/second
   2 open_sync 8k writes            55.785/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
   8k write, fsync, close           28.779/second
   8k write, close, fsync           28.855/second

This is nice to see from a reliability perspective. On all three of the viable sync methods here, the speed seen suggests the drive's volatile write cache is being flushed after every commit. This is going to be bad for people who have gotten used to doing development on systems where that's not honored and they don't care, because this looks like a 90% drop in performance on those systems. But since the new behavior is safe and the earlier one was not, it's hard to get mad about it. Developers probably just need to be taught to turn synchronous_commit off to speed things up when playing with test data.

test_fsync writes to /var/tmp/test_fsync.out by default, not paying attention to what directory you're in. So to use it to test another filesystem, you have to make sure to give it an explicit full path. Next I tested against the old Ubuntu partition that was formatted with ext3, with the write cache still on:

# mount | grep /ext3
/dev/sda5 on /ext3 type ext3 (rw)
# ./test_fsync -f /ext3/test_fsync.out
Loops = 10000

Simple write:
   8k write                      100943.825/second

Compare file sync methods using one write:
   open_datasync 8k write          106.017/second
   open_sync 8k write              108.318/second
   8k write, fdatasync             108.115/second
   8k write, fsync                 105.270/second

Compare file sync methods using two writes:
   2 open_datasync 8k writes        53.313/second
   2 open_sync 8k writes            54.045/second
   8k write, 8k write, fdatasync    55.291/second
   8k write, 8k write, fsync        53.243/second

Compare open_sync with different sizes:
   open_sync 16k write              54.980/second
   2 open_sync 8k writes            53.563/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
   8k write, fsync, close          105.032/second
   8k write, close, fsync          103.987/second

Strange...it looks like ext3 is executing cache flushes, too. Note that all of the "Compare file sync methods using two writes" results are half speed now; it's as if ext3 is flushing the first write out immediately? This result was unexpected, and I don't trust it yet; I want to validate this elsewhere.

What about XFS?  That's a first class filesystem on RHEL6 too:

[root@meddle fsync]# ./test_fsync -f /xfs/test_fsync.out
Loops = 10000

Simple write:
   8k write                      71878.324/second

Compare file sync methods using one write:
   open_datasync 8k write           36.303/second
   open_sync 8k write               35.714/second
   8k write, fdatasync              35.985/second
   8k write, fsync                  35.446/second

I stopped that there, sick of waiting for it, as there's obviously some serious work (mounting options or such at a minimum) that needs to be done before XFS matches the other two. Will return to that later.

So, what have we learned so far:

1) On these newer kernels, both ext4 and ext3 seem to be pushing data out through the drive write caches correctly.

2) On single writes, there's no performance difference between the main three methods you might use, with the straight fsync method having a serious regression in this use case.

3) WAL writes that are forced by wal_buffers filling will turn into a commit-length write when using the new, default open_datasync. Using the older default of fdatasync avoids that problem, in return for causing WAL writes to pollute the OS cache. The main benefit of O_DSYNC writes over fdatasync ones is avoiding the OS cache.

I want to next go through and replicate some of the actual database level tests before giving a full opinion on whether this data proves it's worth changing the wal_sync_method detection. So far I'm torn between whether that's the right approach, or if we should just increase the default value for wal_buffers to something more reasonable.

--
Greg Smith   2ndQuadrant US    greg@xxxxxxxxxxxxxxx   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux