Re: Maximum transaction rate

Marco Colombo <pgsql@xxxxxxxxxx> · Mon, 30 Mar 2009 15:15:02 +0200

Markus Wanner wrote:
> Hi,
> 
> Martijn van Oosterhout wrote:
>> And fsync better do what you're asking
>> (how fast is just a performance issue, just as long as it's done).
> 
> Where are we on this issue? I've read all of this thread and the one on
> the lvm-linux mailing list as well, but still don't feel confident.
> 
> In the following scenario:
> 
>   fsync -> filesystem -> physical disk
> 
> I'm assuming the filesystem correctly issues an blkdev_issue_flush() on
> the physical disk upon fsync(), to do what it's told: flush the cache(s)
> to disk. Further, I'm also assuming the physical disk is flushable (i.e.
> it correctly implements the blkdev_issue_flush() call). Here we can be
> pretty certain that fsync works as advertised, I think.
> 
> The unanswered question to me is, what's happening, if I add LVM in
> between as follows:
> 
>   fsync -> filesystmem -> device mapper (lvm) -> physical disk(s)
> 
> Again, assume the filesystem issues a blkdev_issue_flush() to the lower
> layer and the physical disks are all flushable (and implement that
> correctly). How does the device mapper behave?
> 
> I'd expect it to forward the blkdev_issue_flush() call to all affected
> devices and only return after the last one has confirmed and completed
> flushing its caches. Is that the case?
> 
> I've also read about the newish write barriers and about filesystems
> implementing fsync with such write barriers. That seems fishy to me and
> would of course break in combination with LVM (which doesn't completely
> support write barriers, AFAIU). However, that's clearly the filesystem
> side of the story and has not much to do with whether fsync lies on top
> of LVM or not.
> 
> Help in clarifying this issue greatly appreciated.
> 
> Kind Regards
> 
> Markus Wanner

Well, AFAIK, the summary would be:

1) adding LVM to the chain makes no difference;

2) you still need to disable the write-back cache in IDE/SATA disks,
for fsync() to work properly.

3) without LVM and with write-back cache enabled, due to current(?)
limitations in the linux kernel, with some journaled filesystems
(but not ext3 in data=write-back or data=ordered mode, I'm not sure
about data=journal), you may be less vulnerable, if you use fsync()
(or O_SYNC).

"less vulnerable" means that all pending changes are commetted to disk,
but the very last one.

So:
- write-back cache + EXT3 = unsafe
- write-back cache + other fs = (depending on the fs)[*] safer but not 100% safe
- write-back cache + LVM + any fs = unsafe
- write-thru cache + any fs = safe
- write-thru cache + LVM + any fs = safe

[*] the fs must use (directly or indirectly via journal commit) a write barrier
on fsync(). Ext3 doesn't (it does when the inode changes, but that happens
once a second only).

If you want both speed and safety, use a batter-backed controller (and write-thru
cache on disks, but the controller should enforce it when you plug the disks in).
It's the usual "Fast, Safe, Cheap: choose two".

This is an interesting article:

http://support.microsoft.com/kb/234656/en-us/

note how for all three kinds of disk (IDE/SATA/SCSI) they say:
"Disk caching should be disabled in order to use the drive with SQL Server".

They don't mention write barriers.

.TM.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general