On 4/2/09 1:53 AM, "Greg Smith" <gsmith@xxxxxxxxxxxxx> wrote: > On Wed, 1 Apr 2009, Scott Carey wrote: > >> Write caching on SATA is totally fine. There were some old ATA drives that >> when paried with some file systems or OS's would not be safe. There are >> some combinations that have unsafe write barriers. But there is a standard >> well supported ATA command to sync and only return after the data is on >> disk. If you are running an OS that is anything recent at all, and any >> disks that are not really old, you're fine. > > While I would like to believe this, I don't trust any claims in this area > that don't have matching tests that demonstrate things working as > expected. And I've never seen this work. > > My laptop has a 7200 RPM drive, which means that if fsync is being passed > through to the disk correctly I can only fsync <120 times/second. Here's > what I get when I run sysbench on it, starting with the default ext3 > configuration: > > $ uname -a > Linux gsmith-t500 2.6.28-11-generic #38-Ubuntu SMP Fri Mar 27 09:00:52 UTC > 2009 i686 GNU/Linux > > $ mount > /dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro) > > $ sudo hdparm -I /dev/sda | grep FLUSH > * Mandatory FLUSH_CACHE > * FLUSH_CACHE_EXT > > $ ~/sysbench-0.4.8/sysbench/sysbench --test=fileio --file-fsync-freq=1 > --file-num=1 --file-total-size=16384 --file-test-mode=rndwr run > sysbench v0.4.8: multi-threaded system evaluation benchmark > > Running the test with following options: > Number of threads: 1 > > Extra file open flags: 0 > 1 files, 16Kb each > 16Kb total file size > Block size 16Kb > Number of random requests for random IO: 10000 > Read/Write ratio for combined random IO test: 1.50 > Periodic FSYNC enabled, calling fsync() each 1 requests. > Calling fsync() at the end of test, Enabled. > Using synchronous I/O mode > Doing random write test > Threads started! > Done. > > Operations performed: 0 Read, 10000 Write, 10000 Other = 20000 Total > Read 0b Written 156.25Mb Total transferred 156.25Mb (39.176Mb/sec) > 2507.29 Requests/sec executed > > > OK, that's clearly cached writes where the drive is lying about fsync. > The claim is that since my drive supports both the flush calls, I just > need to turn on barrier support, right? > > [Edit /etc/fstab to remount with barriers] > > $ mount > /dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro,barrier=1) > > [sysbench again] > > 2612.74 Requests/sec executed > > ----- > > This is basically how this always works for me: somebody claims barriers > and/or SATA disks work now, no really this time. I test, they give > answers that aren't possible if fsync were working properly, I conclude > turning off the write cache is just as necessary as it always was. If you > can suggest something wrong with how I'm testing here, I'd love to hear > about it. I'd like to believe you but I can't seem to produce any > evidence that supports you claims here. Your data looks good, and puts a lot of doubt on my previous sources of info. So I did more research, it seems that (most) drives don't lie, your OS and file system do (or sometimes drive drivers or raid card). I know LVM and MD and other Linux block remapping layer things break write barriers as well. Apparently ext3 doesn't implement fsync with a write barrier or cache flush. Linux kernel mailing lists implied that 2.6 had fixed these, but apparently not. Write barriers were fixed, but not fsync. Even more confusing, it looks like the behavior in some linux versions that are highly patched and backported (SUSE, RedHat, mostly) may behave differently than those closer to the kernel trunk like Ubuntu. If you can, try xfs with write barriers on. I'll try some tests using FIO (not familiar with sysbench but looks easy too) with various file systems and some SATA and SAS/SCSI setups when I get a chance. A lot of my prior evidence came from the linux kernel list and other places where I trusted the info over the years. I'll dig up more. But here is what I've learned in the past plus a bit from today: Drives don't lie anymore, and write barrier and lower level ATA commands just work. Linux fixed write barrier support in kernel 2.5. Several OS's do things right and many don't with respect to fsync. I had thought linux did fix this but it turns out they only fixed write barriers and left fsync broken: http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/987024/thread In your tests the barriers slowed things down a lot, so something is working right there. From what I can see, with ext3 metadata changes cause much more frequent write barrier activity, so 'relatime' and 'noatime' actually HURT your data integrity as a side effect of fsync not guaranteeing what you think it does. The big one, is this quote from the linux kernel list: " Right now, if you want a reliable database on Linux, you _cannot_ properly depend on fsync() or fdatasync(). Considering how much Linux is used for critical databases, using these functions, this amazes me. " Check this full post out that started that thread: http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/987024 I admit that it looks like I'm pretty wrong for Linux with ext3 at the least. Linux is often not safe with disk write caches because its fsync() call doesn't flush the cache. The root problem, is not the drives, its linux / ext3. Its write-barrier support is fine now (if you don't go through LVM or MD which don't support it), but fsync does not guarantee anything other than the write having left the OS and gone to the device. In fact POSIX fsync(2) doesn't require that the data is on disk. Interestingly, postgres would be safer on linux if it used sync_file_range instead of fsync() but that has other drawbacks and limitations -- and is broken by use of LVM or MD. Currently, linux + ext3 + postgres, you are only guaranteed when fsync() returns that the data has left the OS, not that it is on a drive -- SATA or SAS. Strangely, sync_file_range() is safer than fsync() in the presence of any drive cache at all (including battery backed raid card failure) because it at least enforces write barriers. Fsync + SATA write cache is safe on Solaris with ZFS, but not Solaris with UFS (the file system is write barrier and cache aware for the former and not the latter). Linux (a lot) and Postgres (a little) can learn from some of the ZFS concepts with regard to atomicity of changes and checksums on data and metadata. Much of the above issues would simply not exist in the presence of good checksum use. Ext4 has journal segment checksums, but no metadata or data checksums exist for ability to detect partial writes to anything but the journal. Postgres is adding checksums on data, and is already essentially copy-on-write for MVCC which is awesome -- are xlog writes protected by checksums? Accidental out-of-order writes become an issue that can be dealt with in a log or journal that has checksums even in the presence of OS and File Systems that don't have good guarantees for fsync like Linux + ext3. Postgres could make itself safe even if drive write cache is enabled, fsync lies, AND there is a power failure. If I'm not mistaken, block checksums on data + xlog entry checksums can make it very difficult to corrupt even if fsync is off (though data writes happening before xlog writes are still bad -- that would require external-to-block checksums --like zfs -- to fix)! http://lkml.org/lkml/2005/5/15/85 Where the "disks lie to you" stuff probably came from: http://hardware.slashdot.org/article.pl?sid=05/05/13/0529252&tid=198&tid=128 (turns out its the OS that isn't flushing the cache on fsync). http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache _on_journaled_filesystems.3F So if xfs fsync has a barrier, its safe with either: Raw device that respects cache flush + write caching on. OR Battery backed raid card + drive write caching off. Xfs fsync supposedly works right (need to test) but fdatasync() does not. What this really boils down to is that POSIX fsync does not provide a guarantee that the data is on disk at all. My previous comments are wrong. This means that fsync protects you from OS crashes, but not power failure. It can do better in some systems / implementations. > > -- > * Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD > -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance