> -----Original Message----- > From: Mike Hardy [mailto:mhardy@xxxxxxx] > Sent: Friday, November 18, 2005 11:57 PM > To: Guy > Cc: 'Dan Stromberg'; 'Jure Pečar'; linux-raid@xxxxxxxxxxxxxxx > Subject: Re: raid5 write performance > > > > Guy wrote: > > > It is not just a parity issue. If you have a 4 disk RAID 5, you can't > be > > sure which if any have written the stripe. Maybe the parity was > updated, > > but nothing else. Maybe the parity and 2 data disks, leaving 1 data > disk > > with old data. > > > > Beyond that, md does write caching. I don't think the file system can > tell > > when a write is truly complete. I don't recall ever having a Linux > system > > crash, so I am not worried. But power failures cause the same risk, or > > maybe more. I have seen power failures, even with a UPS! > > Good points there Guy - I do like your example. I'll go further with > crashing too and say that I actually crash outright occasionally. > Usually when building out new machines where I don't know the proper > driver tweaks, or failing hardware, but it happens without power loss. > Its important to get this correct and well understood. > > That said, unless I hear otherwise from someone that works in the code, > I think md won't report the write as complete to upper layers until it > actually is. I don't believe it does write-caching, and regardless, if > it does it must not do it until some durable representation of the data > is committed to hardware and the parity stays dirty until redundancy is > committed. > > Building on that, barring hardware write-caching, I think with a > journalling FS like ext3 and md only reporting the write complete when > it really is, things won't be trusted at the FS level unless they're > durably written to hardware. > > I think that's sufficient to prove consistency across crashes. > > For example, even if you crash during an update to a file smaller than a > stripe, the stripe will be dirty so the bad parity will be discarded and > the filesystem won't trust the blocks that didn't get reported back as > written by md. So that file update is lost, but the FS is consistent and > all the data it can reach is consistent with what it thinks is there. > > So, I continue to believe silent corruption is mythical. I'm still open > to good explanation it's not though. > > -Mike I will take a stab at an explanation. Assume a single stripe has data for 2 different files (A and B). A disk has failed. The file system writes a 4K chunk of data to file A. The parity gets updated, but not the data. Or the data gets updated but not the parity. The system crashes or power fails. The system recovers, but can't do anything about the parity with a failed disk. But the filesystem does its thing. The disk is replaced and added, the data block is reconstructed using a good block and a bad block. The parity now matches the data. But the reconstructed block (file B) is wrong and the file using that block has not been changed for years. So, silent corruption. But you could argue it was a double failure. Also, after a power failure, I have seen the system come back with a single disk failed. I guess that 1 disk had an expired superblock. When you add that disk back, any stripes that were not up to date will be re-constructed with invalid data. I don't know if the intent logging will help here or not. Most likely, more than 1 disk will have an expired superblock. If you force md to assemble the array it resyncs, but I don't know if it does the parity or picks a disk to pretend was just added. It show a single disk as rebuilding, if that is correct, silent corruption could occur. Guy - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html