Re: ordered I/O with multipath

Bryan Henderson <hbryan@xxxxxxxxxx> · Thu, 9 Apr 2009 11:32:43 -0700

> If the RAID code is changed to handle barriers, that would still have
> possible "scattershot" corruption on RAID-5, because writing a single
> sector on the logical device affects more than one visible sector if
> it is interrupted.  In other words, the "radius of corruption" is
> bigger than one sector for RAID-5, and it's not contiguous either.

I've seen several RAID-5 systems, and they all went to great lengths to 
ensure that interrupting a write to Sector A can't destroy Sector B.  It 
isn't easy; it involves journalling.  But I've always taken it as an 
absolute requirement.

I assume you're talking about something like where Sectors 1-5 are covered 
by a single parity sector and the RAID system restarts between it has 
written Sector 1 and when it has written the new parity.  Now if you lose 
Sector 2, you'll recover incorrect contents for it.

Linux kernel RAID-5 isn't one of the ones I've looked at; I presume you're 
saying it does have this problem.

> In principle, journalling filesystems need to know the "radius of
> corruption" to provide robust journalling.  If individual sector
> writes are atomic, this isn't an issue.  Some people think sector
> writes are atomic on modern hard drives (but I wouldn't count on it).
> But it is definitely not atomic when writing to a RAID or multipath if
> the write affects more than one device.

It would make a lot more sense to make the RAID block device driver 
present a block device that can't corrupt data upon something as simple as 
a restart in the middle of write to an unrelated sector than to make 
filesystem drivers comprehend a block device that can.  Less work, more 
integrity.

Some have noted recently that block devices are really too simple to do 
some of the fancy storage things we'd like to do these days anyway, so 
another approach would be to integrate the RAID-5 function in the 
filesystem driver instead of attempting to have a RAID block device layer.

For now, I'll just try to remember not to use Linux kernel RAID-5.

>If individual sector writes are atomic, this isn't an issue.

True, however: atomic is sufficient, but not necessary.  In the real 
world, disk drive writes aren't atomic, and it's OK.  A journalling 
filesystem can deal with a failed write wiping out the previous contents 
of the subject sector.  It just can't deal with a failed write polluting 
some unrelated previously hardened sector.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Storage Systems

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html