On Mon, Aug 24, 2009 at 5:31 AM, Pavel Machek<pavel@xxxxxx> wrote: > > Running journaling filesystem such as ext3 over flashdisk or degraded > RAID array is a bad idea: journaling guarantees no longer apply and > you will get data corruption on powerfail. > > We can't solve it easily, but we should certainly warn the users. I > actually lost data because I did not understand these limitations... > > Signed-off-by: Pavel Machek <pavel@xxxxxx> > > diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt > new file mode 100644 > index 0000000..80fa886 > --- /dev/null > +++ b/Documentation/filesystems/expectations.txt > @@ -0,0 +1,52 @@ > +Linux block-backed filesystems can only work correctly when several > +conditions are met in the block layer and below (disks, flash > +cards). Some of them are obvious ("data on media should not change > +randomly"), some are less so. > + > +Write errors not allowed (NO-WRITE-ERRORS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Writes to media never fail. Even if disk returns error condition > +during write, filesystems can't handle that correctly. > + > + Fortunately writes failing are very uncommon on traditional > + spinning disks, as they have spare sectors they use when write > + fails. > + > +Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Unfortunately, cheap USB/SD flash cards I've seen do have this bug, > +and are thus unsuitable for all filesystems I know. > + > + An inherent problem with using flash as a normal block device > + is that the flash erase size is bigger than most filesystem > + sector sizes. So when you request a write, it may erase and > + rewrite some 64k, 128k, or even a couple megabytes on the > + really _big_ ones. > + > + If you lose power in the middle of that, filesystem won't > + notice that data in the "sectors" _around_ the one your were > + trying to write to got trashed. > + > + RAID-4/5/6 in degraded mode has same problem. > + > + > +Don't damage the old data on a failed write (ATOMIC-WRITES) > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + > +Either whole sector is correctly written or nothing is written during > +powerfail. > + > + Because RAM tends to fail faster than rest of system during > + powerfail, special hw killing DMA transfers may be necessary; > + otherwise, disks may write garbage during powerfail. > + This may be quite common on generic PC machines. > + > + Note that atomic write is very hard to guarantee for RAID-4/5/6, > + because it needs to write both changed data, and parity, to > + different disks. (But it will only really show up in degraded mode). > + UPS for RAID array should help. Can someone clarify if this is true in raid-6 with just a single disk failure? I don't see why it would be. And if not can the above text be changed to reflect raid 4/5 with a single disk failure and raid 6 with a double disk failure are the modes that have atomicity problems. Greg -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html