On Thu, 2 Apr 2009, Scott Carey wrote:
The big one, is this quote from the linux kernel list: " Right now, if you want a reliable database on Linux, you _cannot_ properly depend on fsync() or fdatasync(). Considering how much Linux is used for critical databases, using these functions, this amazes me. "
Things aren't as bad as that out of context quote makes them seem. There are two main problem situations here:
1) You cannot trust Linux to flush data to a hard drive's write cache. Solution: turn off the write cache. Given the general poor state of targeted fsync on Linux (quoting from a downthread comment by David Lang: "in data=ordered mode, the default for most distros, ext3 can end up having to write all pending data when you do a fsync on one file"), those fsyncs were likely to blow out the drive cache anyway.
2) There are no hard guarantees about write ordering at the disk level; if you write blocks ABC and then fsync, you might actually get, say, only B written before power goes out. I don't believe the PostgreSQL WAL design will be corrupted by this particular situation, because until that fsync comes back saying all 3 are done none of them are relied upon.
Interestingly, postgres would be safer on linux if it used sync_file_range instead of fsync() but that has other drawbacks and limitations
I have thought about whether it would be possible to add a Linux-specific improvement here into the code path that does something custom in this area for Windows/Mac OS X when you use fsync_method=fsync_writethrough
We really should update the documentation in this area before 8.4 ships. I'm looking into moving the "Tuning PostgreSQL WAL Synchronization" paper I wrote onto the wiki and then fleshing it out with all this filesystem-specific trivia.
-- * Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance