Re: ordered I/O with multipath

Jamie Lokier <jamie@xxxxxxxxxxxxx> · Thu, 9 Apr 2009 21:00:15 +0100

Bryan Henderson wrote:
> > If the RAID code is changed to handle barriers, that would still have
> > possible "scattershot" corruption on RAID-5, because writing a single
> > sector on the logical device affects more than one visible sector if
> > it is interrupted.  In other words, the "radius of corruption" is
> > bigger than one sector for RAID-5, and it's not contiguous either.
> 
> I've seen several RAID-5 systems, and they all went to great lengths to 
> ensure that interrupting a write to Sector A can't destroy Sector B.  It 
> isn't easy; it involves journalling.  But I've always taken it as an 
> absolute requirement.

How do you do a second layer of journalling (in addition to the
filesystem's) without a big performance penalty for the extra seeks?

> I assume you're talking about something like where Sectors 1-5 are covered 
> by a single parity sector and the RAID system restarts between it has 
> written Sector 1 and when it has written the new parity.  Now if you lose 
> Sector 2, you'll recover incorrect contents for it.
> 
> Linux kernel RAID-5 isn't one of the ones I've looked at; I presume you're 
> saying it does have this problem.

No, I'm assuming it has this problem because every description of
RAID-5 I've seen does not mention journalling or anything equivalent.

> > In principle, journalling filesystems need to know the "radius of
> > corruption" to provide robust journalling.  If individual sector
> > writes are atomic, this isn't an issue.  Some people think sector
> > writes are atomic on modern hard drives (but I wouldn't count on it).
> > But it is definitely not atomic when writing to a RAID or multipath if
> > the write affects more than one device.
> 
> It would make a lot more sense to make the RAID block device driver 
> present a block device that can't corrupt data upon something as simple as 
> a restart in the middle of write to an unrelated sector than to make 
> filesystem drivers comprehend a block device that can.  Less work, more 
> integrity.

A lot less performance?

> Some have noted recently that block devices are really too simple to do 
> some of the fancy storage things we'd like to do these days anyway, so 
> another approach would be to integrate the RAID-5 function in the 
> filesystem driver instead of attempting to have a RAID block device layer.

Like ZFS and BTRFS I guess.

This is why RAID ought to work better in the filesystem.
Two layers of journalling or equivalent does not sound good.

> For now, I'll just try to remember not to use Linux kernel RAID-5.

I've no idea if you should avoid it.  I'm making assumptions.

Other parts of Linux are a bit flaky on the issue of data integrity on
crashes though, and I/O barriers are not passed down through Linux
software RAID-5, so I'd be mighty surprised if it provides atomic writes.

> >If individual sector writes are atomic, this isn't an issue.
> 
> True, however: atomic is sufficient, but not necessary.  In the real 
> world, disk drive writes aren't atomic, and it's OK.  A journalling 
> filesystem can deal with a failed write wiping out the previous contents 
> of the subject sector.  It just can't deal with a failed write polluting 
> some unrelated previously hardened sector.

That's right.  But an failed write might corrupt previously
hardened sectors in these cases:

    - Disks with 4k sectors pretending to be 512 byte sectors.

    - RAIDs without journalling (or other equivalent) and no
      battery backup.

    - SSDs and other flash storage if their internal algorithms are stupid.

I've just noticed that a system crash is not the only way this type of
corruption can happen.

Does this argue for an additional parameter from the block device
hints: In addition to strip sizes, sector size -

   Radius of Corruption on Failed Write?

For hard disks, this is the sector size.  But for RAIDs and maybe some
flash storage, it might be larger.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html