Hi, On Mon, 2008-03-31 at 15:16 +0200, Mathieu Avila wrote: > Le Mon, 31 Mar 2008 11:54:20 +0100, > Steven Whitehouse <swhiteho@xxxxxxxxxx> a écrit : > > > Hi, > > > > Hi, > > > Both GFS1 and GFS2 are safe from this problem since neither of them > > use barriers. Instead we do a flush at the critical points to ensure > > that all data is on disk before proceeding with the next stage. > > > > I don't think this solves the problem. > > Consider a cheap iSCSI disk (no NVRAM, no UPS) accessed by all my GFS > nodes; this disk has a write cache enabled, which means it will reply > that write requests are performed even if they are not really written > on the platters. The disk (like most disks nowadays) has some logic > that allows it to optimize writes by re-scheduling them. It is possible > that all writes are ACK'd before the power failure, but only a fraction > of them were really performed : some are before the flush, some are > after the flush. > --Not all blocks writes before the flush were performed but other > blocks after the flush are written -> the FS is corrupted.-- > So, after the power failure all data in the disk's write cache are > forgotten. If the journal data was in the disk cache, the journal was > not written to disk, but other metadata have been written, so there are > metadata inconsistencies. > I don't agree that write caching implies that I/O must be acked before it has hit disk. It might well be reordered (which is ok), but if we wait for all outstanding I/O completions, then we ought to be able to be sure that all I/O is actually on disk, or at the very least that further I/O will not be reordered with already ACKed data. If devices are sending ACKs in advance of the I/O hitting disk then I think thats broken behaviour. Consider what happens if a device was to send an ACK for a write and then it discovers an uncorrectable error during the write - how would it then be able to report it since it had already sent an "ok"? So far as I can see the only reason for having the drive send an I/O completion back is to report the success or otherwise of the operation, and if that operation hasn't been completed, then we might just as well not wait for ACKs. > This is the problem that I/O barriers try to solve, by really forcing > the block device (and the block layer) to have all blocks issued before > the barrier to be written before any other after the barrier starts > begin written. > > The other solution is to completely disable the write cache of the > disks, but this leads to dramatically bad performances. > If its a choice between poor performance thats correct and good performance which might lose data, then I know which I would choose :-) Not all devices support barriers, so it always has to be an option; ext3 uses the barrier=1 mount option for this reason, and if it fails (e.g. if the underlying device doesn't support barriers) it falls back to the same technique which we are using in gfs1/2. The other thing to bear in mind is that barriers, as currently implemented are not really that great either. It would be nice to replace them with something that allows better performance with (for example) mirrors where the only current method of implementing the barrier is to wait for all the I/O completions from all the disks in the mirror set (and thus we are back to waiting for outstanding I/O again). Steve. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster