On Nov 16, 2010, at 12:39 PM, Greg Smith wrote: > > $ ./test_fsync > Loops = 10000 > > Simple write: > 8k write 88476.784/second > > Compare file sync methods using one write: > (unavailable: open_datasync) > open_sync 8k write 1192.135/second > 8k write, fdatasync 1222.158/second > 8k write, fsync 1097.980/second > > Compare file sync methods using two writes: > (unavailable: open_datasync) > 2 open_sync 8k writes 527.361/second > 8k write, 8k write, fdatasync 1105.204/second > 8k write, 8k write, fsync 1084.050/second > > Compare open_sync with different sizes: > open_sync 16k write 966.047/second > 2 open_sync 8k writes 529.565/second > > Test if fsync on non-write file descriptor is honored: > (If the times are similar, fsync() can sync data written > on a different descriptor.) > 8k write, fsync, close 1064.177/second > 8k write, close, fsync 1042.337/second > > Two notable things here. One, there is no open_datasync defined in this > older kernel. Two, all methods of commit give equally inflated commit > rates, far faster than the drive is capable of. This proves this setup > isn't flushing the drive's write cache after commit. Nit: there is no open_sync, only open_dsync. Prior to recent kernels, only (semantically) open_dsync exists, labeled as open_sync. New kernels move that code to open_datasync and nave a NEW open_sync that supposedly flushes metadata properly. > > You can get safe behavior out of the old kernel by disabling its write > cache: > > $ sudo /sbin/hdparm -W0 /dev/sda > > /dev/sda: > setting drive write-caching to 0 (off) > write-caching = 0 (off) > > Loops = 10000 > > Simple write: > 8k write 89023.413/second > > Compare file sync methods using one write: > (unavailable: open_datasync) > open_sync 8k write 106.968/second > 8k write, fdatasync 108.106/second > 8k write, fsync 104.238/second > > Compare file sync methods using two writes: > (unavailable: open_datasync) > 2 open_sync 8k writes 51.637/second > 8k write, 8k write, fdatasync 109.256/second > 8k write, 8k write, fsync 103.952/second > > Compare open_sync with different sizes: > open_sync 16k write 109.562/second > 2 open_sync 8k writes 52.752/second > > Test if fsync on non-write file descriptor is honored: > (If the times are similar, fsync() can sync data written > on a different descriptor.) > 8k write, fsync, close 107.179/second > 8k write, close, fsync 106.923/second > > And now results are as expected: just under 120/second. > > Onto RHEL6. Setup for this initial test was: > > $ uname -a > Linux meddle 2.6.32-44.1.el6.x86_64 #1 SMP Wed Jul 14 18:51:29 EDT 2010 > x86_64 x86_64 x86_64 GNU/Linux > $ cat /etc/redhat-release > Red Hat Enterprise Linux Server release 6.0 Beta (Santiago) > $ mount > /dev/sda7 on / type ext4 (rw) > > And I started with the write cache off to see a straight comparison > against the above: > > $ sudo hdparm -W0 /dev/sda > > /dev/sda: > setting drive write-caching to 0 (off) > write-caching = 0 (off) > $ ./test_fsync > Loops = 10000 > > Simple write: > 8k write 104194.886/second > > Compare file sync methods using one write: > open_datasync 8k write 97.828/second > open_sync 8k write 109.158/second > 8k write, fdatasync 109.838/second > 8k write, fsync 20.872/second fsync is working now! flushing metadata properly reduces performance. However, shouldn't open_sync slow down vs open_datasync too and be similar to fsync? Did you recompile your test on the RHEL6 system? Code compiled on newer kernels will see O_DSYNC and O_SYNC as two separate sentinel values, lets call them 1 and 2 respectively. Code compiled against earlier kernels will see both O_DSYNC and O_SYNC as the same value, 1. So code compiled against older kernels, asking for O_SYNC on a newer kernel will actually get O_DSYNC behavior! This was intended. I can't find the link to the mail, but it was Linus' idea to make old code that expected the 'faster but incorrect' behavior to retain it on newer kernels. Only a recompile with newer header files will trigger the new behavior and expose the 'correct' open_sync behavior. This will be 'fun' for postgres packagers and users -- data reliability behavior differs based on what kernel it is compiled against. Luckily, the xlogs only need open_datasync semantics. > > Compare file sync methods using two writes: > 2 open_datasync 8k writes 53.902/second > 2 open_sync 8k writes 53.721/second > 8k write, 8k write, fdatasync 109.731/second > 8k write, 8k write, fsync 20.918/second > > Compare open_sync with different sizes: > open_sync 16k write 109.552/second > 2 open_sync 8k writes 54.116/second > > Test if fsync on non-write file descriptor is honored: > (If the times are similar, fsync() can sync data written > on a different descriptor.) > 8k write, fsync, close 20.800/second > 8k write, close, fsync 20.868/second > > A few changes then. open_datasync is available now. Again, noting the detail that it is open_sync that is new (depending on where it is compiled). The old open_sync is relabeled to the new open_datasync. > It looks slightly > slower than the alternatives on this test, but I didn't see that on the > later tests so I'm thinking that's just occasional run to run > variation. For some reason regular fsync is dramatically slower in this > kernel than earlier ones. Perhaps a lot more metadata being flushed all > the way to the disk in that case now? > > The issue that I think Marti has been concerned about is highlighted in > this interesting subset of the data: > > Compare file sync methods using two writes: > 2 open_datasync 8k writes 53.902/second > 8k write, 8k write, fdatasync 109.731/second > > The results here aren't surprising; if you do two dsync writes, that > will take two disk rotations, while two writes followed a single sync > only takes one. But that does mean that in the case of small values for > wal_buffers, like the default, you could easily end up paying a rotation > sync penalty more than once per commit. > > Next question is what happens if I turn the drive's write cache back on: > > $ sudo hdparm -W1 /dev/sda > > /dev/sda: > setting drive write-caching to 1 (on) > write-caching = 1 (on) > > $ ./test_fsync > > [gsmith@meddle fsync]$ ./test_fsync > Loops = 10000 > > Simple write: > 8k write 104198.143/second > > Compare file sync methods using one write: > open_datasync 8k write 110.707/second > open_sync 8k write 110.875/second > 8k write, fdatasync 110.794/second > 8k write, fsync 28.872/second > > Compare file sync methods using two writes: > 2 open_datasync 8k writes 55.731/second > 2 open_sync 8k writes 55.618/second > 8k write, 8k write, fdatasync 110.551/second > 8k write, 8k write, fsync 28.843/second > > Compare open_sync with different sizes: > open_sync 16k write 110.176/second > 2 open_sync 8k writes 55.785/second > > Test if fsync on non-write file descriptor is honored: > (If the times are similar, fsync() can sync data written > on a different descriptor.) > 8k write, fsync, close 28.779/second > 8k write, close, fsync 28.855/second > > This is nice to see from a reliability perspective. On all three of the > viable sync methods here, the speed seen suggests the drive's volatile > write cache is being flushed after every commit. This is going to be > bad for people who have gotten used to doing development on systems > where that's not honored and they don't care, because this looks like a > 90% drop in performance on those systems. > But since the new behavior is > safe and the earlier one was not, it's hard to get mad about it. I would love to see the same tests in this detail for RHEL 5.5 (which has ext3, ext4, and xfs). I think this data reliability issue that requires turning off write cache was in the kernel ~2.6.26 to 2.6.31 range. Ubuntu doesn't really care about this stuff which is one reason I avoid it for a prod db. I know that xfs with the right settings on RHEL 5.5 does not require disabling the write cache. > Developers probably just need to be taught to turn synchronous_commit > off to speed things up when playing with test data. > Absolutely. > test_fsync writes to /var/tmp/test_fsync.out by default, not paying > attention to what directory you're in. So to use it to test another > filesystem, you have to make sure to give it an explicit full path. > Next I tested against the old Ubuntu partition that was formatted with > ext3, with the write cache still on: > > # mount | grep /ext3 > /dev/sda5 on /ext3 type ext3 (rw) > # ./test_fsync -f /ext3/test_fsync.out > Loops = 10000 > > Simple write: > 8k write 100943.825/second > > Compare file sync methods using one write: > open_datasync 8k write 106.017/second > open_sync 8k write 108.318/second > 8k write, fdatasync 108.115/second > 8k write, fsync 105.270/second > > Compare file sync methods using two writes: > 2 open_datasync 8k writes 53.313/second > 2 open_sync 8k writes 54.045/second > 8k write, 8k write, fdatasync 55.291/second > 8k write, 8k write, fsync 53.243/second > > Compare open_sync with different sizes: > open_sync 16k write 54.980/second > 2 open_sync 8k writes 53.563/second > > Test if fsync on non-write file descriptor is honored: > (If the times are similar, fsync() can sync data written > on a different descriptor.) > 8k write, fsync, close 105.032/second > 8k write, close, fsync 103.987/second > > Strange...it looks like ext3 is executing cache flushes, too. Note that > all of the "Compare file sync methods using two writes" results are half > speed now; it's as if ext3 is flushing the first write out immediately? > This result was unexpected, and I don't trust it yet; I want to validate > this elsewhere. > > What about XFS? That's a first class filesystem on RHEL6 too: and available on later RHEL 5's. > > [root@meddle fsync]# ./test_fsync -f /xfs/test_fsync.out > Loops = 10000 > > Simple write: > 8k write 71878.324/second > > Compare file sync methods using one write: > open_datasync 8k write 36.303/second > open_sync 8k write 35.714/second > 8k write, fdatasync 35.985/second > 8k write, fsync 35.446/second > > I stopped that there, sick of waiting for it, as there's obviously some > serious work (mounting options or such at a minimum) that needs to be > done before XFS matches the other two. Will return to that later. > Yes, XFS requires some fiddling. Its metadata operations are also very slow. > So, what have we learned so far: > > 1) On these newer kernels, both ext4 and ext3 seem to be pushing data > out through the drive write caches correctly. > I suspect that some older kernels are partially OK here too. The kernel not flushing properly appeared near 2.6.25 ish. > 2) On single writes, there's no performance difference between the main > three methods you might use, with the straight fsync method having a > serious regression in this use case. I'll ask again -- did you compile the test on RHEL6 for the RHEL6 tests? The behavior in later kernels for this depends on what kernel it was compiled against for open_sync. For fsync, its not a regression, its actually flushing metadata properly and therefore actually robust if there is a power failure during a write. Even the write cache disabled case on the ubuntu kernel could leave a filesystem with corrupt data if the power failed in a metadata intensive write situation. > > 3) WAL writes that are forced by wal_buffers filling will turn into a > commit-length write when using the new, default open_datasync. Using > the older default of fdatasync avoids that problem, in return for > causing WAL writes to pollute the OS cache. The main benefit of O_DSYNC > writes over fdatasync ones is avoiding the OS cache. > > I want to next go through and replicate some of the actual database > level tests before giving a full opinion on whether this data proves > it's worth changing the wal_sync_method detection. So far I'm torn > between whether that's the right approach, or if we should just increase > the default value for wal_buffers to something more reasonable. > > -- > Greg Smith 2ndQuadrant US greg@xxxxxxxxxxxxxxx Baltimore, MD > PostgreSQL Training, Services and Support www.2ndQuadrant.us > "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance