Re: Maximum transaction rate

Greg Smith <gsmith@xxxxxxxxxxxxx> · Tue, 17 Mar 2009 17:43:21 -0400 (EDT)

On Tue, 17 Mar 2009, Marco Colombo wrote:

If LVM/dm is lying about fsync(), all this is moot. There's no point
talking about disk caches.

I decided to run some tests to see what's going on there, and it looks 
like some of my quick criticism of LVM might not actually be valid--it's 
only the performance that is problematic, not necessarily the reliability. 
Appears to support fsync just fine.  I tested with kernel 2.6.22, so 
certainly not before the recent changes to LVM behavior improving this 
area, but with the bugs around here from earlier kernels squashed (like 
crummy HPA support circa 2.6.18-2.6.19, see 
https://launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/82314 )

You can do a quick test of fsync rate using sysbench; got the idea from 
http://www.mysqlperformanceblog.com/2006/05/03/group-commit-and-real-fsync/
(their command has some typos, fixed one below)

If fsync is working properly, you'll get something near the RPM rate of 
the disk.  If it's lying, you'll see a much higher number.

I couldn't get the current sysbench-0.4.11 to compile (bunch of X 
complains from libtool), but the old 0.4.8 I had around still works fine. 
Let's start with a regular ext3 volume.  Here's what I see against a 7200 
RPM disk (=120 rotations/second) with the default caching turned on:

$ alias fsynctest="~/sysbench-0.4.8/sysbench/sysbench --test=fileio --file-fsync-freq=1 --file-num=1 --file-total-size=16384 --file-test-mode=rndwr run | grep \"Requests/sec\""
$ fsynctest
 6469.36 Requests/sec executed

That's clearly lying as expected (and I ran all these a couple of times, 
just reporting one for brevity sake; snipped some other redundant stuff 
too).  I followed the suggestions at 
http://www.postgresql.org/docs/current/static/wal-reliability.html to turn 
off the cache and tested again:

$ sudo /sbin/hdparm -I /dev/sdf | grep "Write cache"
           *    Write cache
$ sudo /sbin/hdparm -W0 /dev/sdf

/dev/sdf:
 setting drive write-caching to 0 (off)
$ sudo /sbin/hdparm -I /dev/sdf | grep "Write cache"
                Write cache
$ fsynctest
  106.05 Requests/sec executed
$ sudo /sbin/hdparm -W1 /dev/sdf
$ fsynctest
 6469.36 Requests/sec executed

Great:  I was expecting ~120 commits/sec from a 7200 RPM disk, that's what 
I get when caching is off.

Now, let's switch to using a LVM volume on a different partition of 
that disk, and run the same test to see if anything changes.

$ sudo mount /dev/lvmvol/lvmtest /mnt/
$ cd /mnt/test
$ fsynctest
 6502.67 Requests/sec executed
$ sudo /sbin/hdparm -W0 /dev/sdf
$ fsynctest
  112.78 Requests/sec executed
$ sudo /sbin/hdparm -W1 /dev/sdf
$ fsynctest
 6499.11 Requests/sec executed

Based on this test, it looks to me like fsync works fine on LVM.  It must 
be passing that down to the physical disk correctly or I'd still be seeing 
inflated rates.  If you've got a physical disk that lies about fsync, and 
you put a database on it, you're screwed whether or not you use LVM; 
nothing different on LVM than in the regular case.  A battery-backed 
caching controller should also handle fsync fine if it turns off the 
physical disk cache, which most of them do--and, again, you're no more or 
less exposed to that particular problem with LVM than a regular 
filesystem.

The thing that barriers helps out with is that it makes it possible to 
optimize flushing ext3 journal metadata when combined with hard drives 
that support the appropriate cache flushing mechanism (what hdparm calls 
"FLUSH CACHE EXT"; see 
http://forums.opensuse.org/archives/sls-archives/archives-suse-linux/archives-desktop-environments/379681-barrier-sync.html 
).  That way you can prioritize flushing just the metadata needed to 
prevent filesystem corruption while still fully caching less critical 
regular old writes.  In that situation, performance could be greatly 
improved over turning off caching altogether.  However, in the PostgreSQL 
case, the fsync hammer doesn't appreciate this optimization anyway--all 
the database writes are going to get forced out by that no matter what 
before the database considers them reliable.  Proper barriers support 
might be helpful in the case where you're using a database on a shared 
disk that has other files being written to as well, basically allowing 
caching on those while forcing the database blocks to physical disk, but 
that presumes the Linux fsync implementation is more sophisticated than I 
believe it currently is.

Far as I can tell, the main open question I didn't directly test here is 
whether LVM does any write reordering that can impact database use because 
it doesn't handle write barriers properly.  According to 
https://www.redhat.com/archives/linux-lvm/2009-March/msg00026.html it does 
not, and I never got the impression that was impacted by the LVM layer 
before.  The concern is nicely summarized by the comment from Xman at 
http://lwn.net/Articles/283161/ :

"fsync will block until the outstanding requests have been sync'd do disk, 
but it doesn't guarantee that subsequent I/O's to the same fd won't 
potentially also get completed, and potentially ahead of the I/O's 
submitted prior to the fsync. In fact it can't make such guarantees 
without functioning barriers."

Since we know LVM does not have functioning barriers, this would seem to 
be one area where PostgreSQL would be vulnerable.  But since ext3 doesn't 
have barriers turned by default either (except some recent SuSE system), 
it's not unique to a LVM setup, and if this were really a problem it would 
be nailing people everywhere.  I believe the WAL design handles this 
situation.

There are some known limitations to Linux fsync that I remain somewhat 
concerned about, independantly of LVM, like "ext3 fsync() only does a 
journal commit when the inode has changed" (see 
http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 ).  The 
way files are preallocated, the PostgreSQL WAL is supposed to function 
just fine even if you're using fdatasync after WAL writes, which also 
wouldn't touch the journal (last time I checked fdatasync was implemented 
as a full fsync on Linux).  Since the new ext4 is more aggressive at 
delaying writes than ext3, it will be interesting to see if that uncovers 
some subtle race conditions here that have been lying dormant so far.

I leave it as an exercise to the dedicated reader to modify the sysbench 
test to use O_SYNC/O_DIRECT in order to re-test LVM for the situation if 
you changed wal_sync_method=open_sync , how to do that is mentioned 
briefly at http://sysbench.sourceforge.net/docs/

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general