Re: Triple-parity raid6

David Brown <david@xxxxxxxxxxxxxxx> · Fri, 10 Jun 2011 11:03:26 +0200

On 09/06/2011 14:04, NeilBrown wrote:
On Thu, 09 Jun 2011 13:32:59 +0200 David Brown<david@xxxxxxxxxxxxxxx>  wrote:

On 09/06/2011 03:49, NeilBrown wrote:
On Thu, 09 Jun 2011 02:01:06 +0200 David Brown<david.brown@xxxxxxxxxxxx>
wrote:

Has anyone considered triple-parity raid6 ?  As far as I can see, it
should not be significantly harder than normal raid6 - either  to
implement, or for the processor at run-time.  Once you have the GF(2â)
field arithmetic in place for raid6, it's just a matter of making
another parity block in the same way but using a different generator:

P = D_0 + D_1 + D_2 + .. + D_(n.1)
Q = D_0 + g.D_1 + gÂ.D_2 + .. + g^(n-1).D_(n.1)
R = D_0 + h.D_1 + hÂ.D_2 + .. + h^(n-1).D_(n.1)

The raid6 implementation in mdraid uses g = 0x02 to generate the second
parity (based on "The mathematics of RAID-6" - I haven't checked the
source code).  You can make a third parity using h = 0x04 and then get a
redundancy of 3 disks.  (Note - I haven't yet confirmed that this is
valid for more than 100 data disks - I need to make my checker program
more efficient first.)

Rebuilding a disk, or running in degraded mode, is just an obvious
extension to the current raid6 algorithms.  If you are missing three
data blocks, the maths looks hard to start with - but if you express the
equations as a set of linear equations and use standard matrix inversion
techniques, it should not be hard to implement.  You only need to do
this inversion once when you find that one or more disks have failed -
then you pre-compute the multiplication tables in the same way as is
done for raid6 today.

In normal use, calculating the R parity is no more demanding than
calculating the Q parity.  And most rebuilds or degraded situations will
only involve a single disk, and the data can thus be re-constructed
using the P parity just like raid5 or two-parity raid6.

I'm sure there are situations where triple-parity raid6 would be
appealing - it has already been implemented in ZFS, and it is only a
matter of time before two-parity raid6 has a real probability of hitting
an unrecoverable read error during a rebuild.

And of course, there is no particular reason to stop at three parity
blocks - the maths can easily be generalised.  1, 2, 4 and 8 can be used
as generators for quad-parity (checked up to 60 disks), and adding 16
gives you quintuple parity (checked up to 30 disks) - but that's maybe
getting a bit paranoid.

ref.:

<http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>
<http://blogs.oracle.com/ahl/entry/acm_triple_parity_raid>
<http://queue.acm.org/detail.cfm?id=1670144>
<http://blogs.oracle.com/ahl/entry/triple_parity_raid_z>

   -ENOPATCH  :-)

I have a series of patches nearly ready which removes a lot of the remaining
duplication in raid5.c between raid5 and raid6 paths.  So there will be
relative few places where RAID5 and RAID6 do different things - only the
places where they *must* do different things.
After that, adding a new level or layout which has 'max_degraded == 3' would
be quite easy.
The most difficult part would be the enhancements to libraid6 to generate the
new 'syndrome', and to handle the different recovery possibilities.

So if you're not otherwise busy this weekend, a patch would be nice :-)

I'm not going to promise any patches, but maybe I can help with the
maths.  You say the difficult part is the syndrome calculations and
recovery - I've got these bits figured out on paper and some
quick-and-dirty python test code.  On the other hand, I don't really
want to get into the md kernel code, or the mdadm code - I haven't done
Linux kernel development before (I mostly program 8-bit microcontrollers
- when I code on Linux, I use Python), and I fear it would take me a
long time to get up to speed.

However, if the parity generation and recovery is neatly separated into
a libraid6 library, the whole thing becomes much more tractable from my
viewpoint.  Since I am new to this, can you tell me where I should get
the current libraid6 code?  I'm sure google will find some sources for
me, but I'd like to make sure I start with whatever version /you/ have.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

You can see the current kernel code at:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tree;f=lib/raid6;h=970c541a452d3b9983223d74b10866902f1a47c7;hb=HEAD

int.uc is the generic C code which 'unroll.awk' processes to make various
versions that unroll the loops different amounts to work with CPUs with
different numbers of registers.
Then there is sse1, sse2, altivec which provide the same functionality in
assembler which is optimised for various processors.

And 'recov' has the smarts for doing the reverse calculation when 2 data
blocks, or 1 data and P are missing.

Even if you don't feel up to implementing everything, a start might be
useful.  You never know when someone might jump up and offer to help.

NeilBrown

When looking at recov.c, I see in the "raid6_dual_recov" function there 
is no code for testing data+Q failure as it is equivalent to raid5 
recovery.  Should this not still be implemented here so that testing can 
be more complete?

Is there a general entry point for the recovery routines, which then 
decides which of raid6_2data_recov, raid6_datap_recov, or 
raid6_dual_recov is called?  With triple-parity raid, there are many 
more combinations - it would make sense for the library to have a single 
function like :

void raid7_3_recov(int disks, size_t bytes, int noOfFails,
		int *pFails, void **ptrs);

or even (to cover quad parity and more) :

void raid7_n_recov(int disks, int noOfParities, size_t bytes,
	int noOfFails, int *pFails, void **ptrs);

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html