Eric Sandeen wrote: > > If we were seeing a significant number of "hey, my disk got wrecked" > > reports which attributable to this then yes, perhaps we should change > > the default. But I've never seen _any_, although I've seen claims that > > others have seen reports. > > Hm, how would we know, really? What does it look like? It'd totally > depend on what got lost... When do you find out? Again depends what > you're doing, I think. I'll admit that I don't have any good evidence > of my own. I'll go off and do some plug-pull-testing and a benchmark or > two. You have to pull the plug quite a lot, while there is data in write cache, and when the data is something you will notice later. Checking filesystem is hard. Something systematic would be good - for which you will want an electronically controlled power switch. I have seen corruption which I believe is from lack of barriers, and hasn't occurred since I implemented them (or disabled write cache). But it's hard to be sure that was the real cause. If you just want to test the block I/O layer and drive itself, don't use the filesystem, but write a program which just access the block device, continuously writing with/without barriers every so often, and after power cycle read back to see what was and wasn't written. I think there may be drives which won't show any effect - if they have enough internal power (from platter inertia) to write everything in the cache before losing it. If you want to test, the worst case is to queue many small writes at seek positions acrosss the disk, so that flushing the disk's write cache takes the longest time. A good pattern might be take numbers 0..2^N-1 (e.g. 0..255), for each number reverse the bit order (0, 128, 64, 192...) and do writes at those block positions, scaling up to the range of the whole disk. The idea is if the disk just caches the last few queued, they will always be quite spread out. However, a particular disk's cache algorithm may bound the write cache size by predicted seek time needed to flush it, rather than bytes, defeating that. > But, drive caches are only getting bigger, I assume this can't help. I > have a hard time seeing how speed at the cost of correctness is the > right call... I agree. The MacOS X folks decided that speed is most important for fsync(). fsync() does not guarantee commit to platter. *But* they added an fcntl() for applications to request a commit to platter, which SQLite at least uses. I don't know if MacOS X uses barriers for filesystem operations. > > Do we know which distros are enabling barriers by default? > > SuSE does (via patch for ext3). SuSE did the same for 2.4: they patched the kernel to add barriers to 2.4. (I'm using an improved version of that patch in my embedded devices). It's nice they haven't weakened the behavior in 2.6. > Red Hat & Fedora don't, Neither did they patch for barriers in 2.4, but we can't hold that against them :-) > and install by default on lvm which won't pass barriers anyway. Considering how many things depend on LVM not passing barriers, that is scary. People use software RAID assuming integrity. They are immune to many hardware faults. But it turns out, on Linux, that a single disk can have higher integrity against power failure than a RAID. > So maybe it's hypocritical to send this patch from redhat.com :) So send the patch to fix LVM too :-) > And as another "who uses barriers" datapoint, reiserfs & xfs both have > them on by default. Are they noticably slower than ext3? If not, it suggests ext3 can be fixed to keep its performance with barriers enabled. Specifically: under some workloads, batching larger changes into the journal between commit blocks might compensate. Maybe the journal has been tuned for barriers off because they are by default? > I suppose alternately I could send another patch to remove "remember > that ext3/4 by default offers higher data integrity guarantees than > most." from Documentation/filesystems/ext4.txt ;) It would be fair. I suspect a fair number of people are under the impression ext3 uses barriers with no special options, prior to this thread. It was advertised as a feature in development during the 2.5 series. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html