Bryan Henderson wrote: > > If the RAID code is changed to handle barriers, that would still have > > possible "scattershot" corruption on RAID-5, because writing a single > > sector on the logical device affects more than one visible sector if > > it is interrupted. In other words, the "radius of corruption" is > > bigger than one sector for RAID-5, and it's not contiguous either. > > I've seen several RAID-5 systems, and they all went to great lengths to > ensure that interrupting a write to Sector A can't destroy Sector B. It > isn't easy; it involves journalling. But I've always taken it as an > absolute requirement. How do you do a second layer of journalling (in addition to the filesystem's) without a big performance penalty for the extra seeks? > I assume you're talking about something like where Sectors 1-5 are covered > by a single parity sector and the RAID system restarts between it has > written Sector 1 and when it has written the new parity. Now if you lose > Sector 2, you'll recover incorrect contents for it. > > Linux kernel RAID-5 isn't one of the ones I've looked at; I presume you're > saying it does have this problem. No, I'm assuming it has this problem because every description of RAID-5 I've seen does not mention journalling or anything equivalent. > > In principle, journalling filesystems need to know the "radius of > > corruption" to provide robust journalling. If individual sector > > writes are atomic, this isn't an issue. Some people think sector > > writes are atomic on modern hard drives (but I wouldn't count on it). > > But it is definitely not atomic when writing to a RAID or multipath if > > the write affects more than one device. > > It would make a lot more sense to make the RAID block device driver > present a block device that can't corrupt data upon something as simple as > a restart in the middle of write to an unrelated sector than to make > filesystem drivers comprehend a block device that can. Less work, more > integrity. A lot less performance? > Some have noted recently that block devices are really too simple to do > some of the fancy storage things we'd like to do these days anyway, so > another approach would be to integrate the RAID-5 function in the > filesystem driver instead of attempting to have a RAID block device layer. Like ZFS and BTRFS I guess. This is why RAID ought to work better in the filesystem. Two layers of journalling or equivalent does not sound good. > For now, I'll just try to remember not to use Linux kernel RAID-5. I've no idea if you should avoid it. I'm making assumptions. Other parts of Linux are a bit flaky on the issue of data integrity on crashes though, and I/O barriers are not passed down through Linux software RAID-5, so I'd be mighty surprised if it provides atomic writes. > >If individual sector writes are atomic, this isn't an issue. > > True, however: atomic is sufficient, but not necessary. In the real > world, disk drive writes aren't atomic, and it's OK. A journalling > filesystem can deal with a failed write wiping out the previous contents > of the subject sector. It just can't deal with a failed write polluting > some unrelated previously hardened sector. That's right. But an failed write might corrupt previously hardened sectors in these cases: - Disks with 4k sectors pretending to be 512 byte sectors. - RAIDs without journalling (or other equivalent) and no battery backup. - SSDs and other flash storage if their internal algorithms are stupid. I've just noticed that a system crash is not the only way this type of corruption can happen. Does this argue for an additional parameter from the block device hints: In addition to strip sizes, sector size - Radius of Corruption on Failed Write? For hard disks, this is the sector size. But for RAIDs and maybe some flash storage, it might be larger. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html