Re: POSIX file updates

Greg Smith <gsmith@xxxxxxxxxxxxx> · Wed, 2 Apr 2008 19:39:46 -0400 (EDT)

On Wed, 2 Apr 2008, James Mansion wrote:

But amusingly, PostgreSQL doesn't even support Solaris's direct I/O 
method right now unless you override the filesystem mounting options, 
so you end up needing to split it out and hack at that level 
regardless.
Indeed that's a shame. Why doesn't it use the directio?

You turn on direct I/O differently under Solaris then everywhere else, and 
nobody has bothered to write the patch (trivial) and OS-specific code to 
turn it on only when appropriate (slightly tricker) to handle this case. 
There's not a lot of pressure on PostgreSQL to handle this case correctly 
when Solaris admins are used to doing direct I/O tricks on filesystems 
already, so they don't complain about it much.

Yes but fsync and stable on disk isn't the same thing if there is a 
cache anywhere is it? Hence the fuss a while back about Apple's control 
of disk caches. Solaris and Windows do it too.

If your caches don't honor fsync by making sure it's on disk or a 
battery-backed cache, you can't use them and expect PostgreSQL to operate 
reliably.  Back to that "doesn't honor the contract" case.  The code that 
implements fsync_writethrough on both Windows and Mac OS handles those two 
cases by writing with the appropriate flags to not get cached in a harmful 
way.  I'm not aware of Solaris doing anything stupid here--the last two 
Solaris x64 systems I've tried that didn't have a real controller write 
cache ignored the drive cache and blocked at fsync just as expected, 
limiting commits to the RPM of the drive.  Seen it on UFS and ZFS, both 
seem to do the right thing here.

Isn't allowing the OS to accumulate an arbitrary number of dirty blocks 
without control of the rate at which they spill to media just exposing a 
possibility of an IO storm when it comes to checkpoint time?  Does 
bgwriter attempt to control this with intermediate fsync (and push to 
media if available)?

It can cause exactly such a storm.  If you haven't noticed my other paper 
at http://www.westnet.com/~gsmith/content/linux-pdflush.htm yet it goes 
over this exact issue as far as how Linux handles it.  Now that it's easy 
to get even a home machine to have 8GB of RAM in it, Linux will gladly 
buffer ~800MB worth of data for you and cause a serious storm at fsync 
time.  It's not pretty when that happens into a single SATA drive because 
there's typically plenty of seeks in that write storm too.

There was a prototype implementation plan that wasn't followed completely 
through in 8.3 to spread fsyncs out a bit better to keep this from being 
as bad.  That optimization might make it into 8.4 but I don't know that 
anybody is working on it.  The spread checkpoints in 8.3 are so much 
better than 8.2 that many are happy to at least have that.

It strikes me as odd that fsync_writethrough isn't the most preferred 
option where it is implemented.

It's only available on Win32 and Mac OS X (the OSes that might get it 
wrong without that nudge).  I believe every path through the code uses it 
by default on those platforms, there's a lot of remapping in there.

You can get an idea of what code was touched by looking at the patch that 
added the OS X version of fsync_writethrough (it was previously only 
Win32):  http://archives.postgresql.org/pgsql-patches/2005-05/msg00208.php

The postgres approach of *requiring* that there be no cache below the OS 
is problematic, especially since the battery backup on internal array 
controllers is hardly the handiest solution when you find the mobo has 
died.

If the battery backup cache doesn't survive being moved to another machine 
after a motherboard failure, it's not very good.  The real risk to be 
concerned about is what happens if the card itself dies.  If that happens, 
you can't help but lose transactions.

You seem to feel that there is an alternative here that PostgreSQL could 
take but doesn't.  There is not.  You either wait until writes hit disk, 
which by physical limitations only happens at RPM speed and therefore is 
too slow to commit for many cases, or you cache in the most reliable 
memory you've got and hope for the best.  No software approach can change 
any of that.

And especially when the inability to flush caches on modern SATA and SAS 
drives would appear to be more a failing in some operating systems than 
in the drives themselves..

I think you're extrapolating too much from the Win32/Apple cases here. 
There are plenty of cases where the so-called "lying" drives themselves 
are completely stupid on their own regardless of operating system.

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance