Re: Triple parity and beyond

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Wed, 20 Nov 2013 19:28:37 -0600

On 11/20/2013 10:16 AM, James Plank wrote:
> Hi all -- no real comments, except as I mentioned to Ric, my tutorial
> in FAST last February presents Reed-Solomon coding with Cauchy
> matrices, and then makes special note of the common pitfall of
> assuming that you can append a Vandermonde matrix to an identity
> matrix.  Please see
> http://web.eecs.utk.edu/~plank/plank/papers/2013-02-11-FAST-Tutorial.pdf,
> slides 48-52.
> 
> Andrea, does the matrix that you included in an earlier mail (the one
> that has Linux RAID-6 in the first two rows) have a general form, or
> did you develop it in an ad hoc manner so that it would include Linux
> RAID-6 in the first two rows?

Hello Jim,

It's always perilous to follow a Ph.D., so I guess I'm feeling suicidal
today. ;)

I'm not attempting to marginalize Andrea's work here, but I can't help
but ponder what the real value of triple parity RAID is, or quad, or
beyond.  Some time ago parity RAID's primary mission ceased to be
surviving single drive failure, or a 2nd failure during rebuild, and
became mitigating UREs during a drive rebuild.  So we're now talking
about dedicating 3 drives of capacity to avoiding disaster due to
platter defects and secondary drive failure.  For small arrays this is
approaching half the array capacity.  So here parity RAID has lost the
battle with RAID10's capacity disadvantage, yet it still suffers the
vastly inferior performance in normal read/write IO, not to mention
rebuild times that are 3-10x longer.

WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
to mirror a drive at full streaming bandwidth, assuming 300MB/s
average--and that is probably being kind to the drive makers.  With 6 or
8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
minimum 72 hours or more, probably over 100, and probably more yet for
3P.  And with larger drive count arrays the rebuild times approach a
week.  Whose users can go a week with degraded performance?  This is
simply unreasonable, at best.  I say it's completely unacceptable.

With these gargantuan drives coming soon, the probability of multiple
UREs during rebuild are pretty high.  Continuing to use ever more
complex parity RAID schemes simply increases rebuild time further.  The
longer the rebuild, the more likely a subsequent drive failure due to
heat buildup, vibration, etc.  Thus, in our maniacal efforts to mitigate
one failure mode we're increasing the probability of another.  TANSTAFL.
 Worse yet, RAID10 isn't going to survive because UREs on a single drive
are increasingly likely with these larger drives, and one URE during
rebuild destroys the array.

I think people are going to have to come to grips with using more and
more drives simply to brace the legs holding up their arrays; comes to
grips with these insane rebuild times; or bite the bullet they so
steadfastly avoided with RAID10.  Lots more spindles solves problems,
but at a greater cost--again, no free lunch.

What I envision is an array type, something similar to RAID 51, i.e.
striped parity over mirror pairs.  In the case of Linux, this would need
to be a new distinct md/RAID level, as both the RAID5 and RAID1 code
would need enhancement before being meshed together into this new level[1].

Potential Advantages:

1.  Only +1 disk capacity overhead vs RAID 10, regardless of drive count
2.  Rebuild time is the same as RAID 10, unless a mirror pair is lost
3.  Parity is only used during rebuild if/when a URE occurs, unless ^
4.  Single drive failure doesn't degrade the parity array, multiple
    failures in different mirrors doesn't degrade the parity array
5.  Can sustain a minimum of 3 simultaneous drive failures--both drives
    in one mirror and one drive in another mirror
6.  Can lose a maximum of 1/2 of the drives plus 1 drive--one more than
    RAID 10.  Can lose half the drives and still not degrade parity,
    if no two comprise one mirror
7.  Similar or possibly better read throughput vs triple parity RAID
8.  Superior write performance with drives down
9.  Vastly superior rebuild performance, as rebuilds will rarely, if
    ever, involve parity

Potential Disadvantages:

1.  +1 disk overhead vs RAID 10, many more than 2/3P w/large arrays
2.  Read-modify-write penalty vs RAID 10
3.  Slower write throughput vs triple parity RAID due to spindle deficit
4.  Development effort
5.  ??

[1]  The RAID1/5 code would need to be patched to properly handle a URE
encountered by the RAID1 code during rebuild.  There are surely other
modifications and/or optimizations that would be needed.  For large
sequential reads, more deterministic read interleaving between mirror
pairs would be a good candidate I think.  IIUC the RAID1 driver does
read interleaving on a per thread basis or some such, which I don't
believe is going to work for this "RAID 51" scenario, at least not for
single streaming reads.  If this can be done well, we double the read
performance of RAID5, and thus we don't completely "waste" all the extra
disks vs big_parity schemes.

This proposed "RAID level 51" should have drastically lower rebuild
times vs traditional striped parity, should not suffer read/write
performance degradation with most disk failure scenarios, and with a
read interleaving optimization may have significantly greater streaming
read throughput as well.

This is far from a perfect solution and I am certainly not promoting it
as such.  But I think it does have some serious advantages over
traditional striped parity schemes, and at minimum is worth discussion
as a counterpoint of sorts.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html