Re: RFC - de-clustered raid 60 or 61 algorithm

Feng Zhang <prod.feng@xxxxxxxxx> · Tue, 2 Oct 2018 10:23:49 -0400

Hello all,

Any progress on this de-clustered raid?

May be you guys have known already that side from the de-clustered
raid which distribute data and parity chunks, there's also a method
that distribute spare chunks with the data and parity, which says can
bring a fault array(with 1 or more disks failed) back to normal
production status(rebuild) very quickly.

Like the OpenZFS, https://github.com/zfsonlinux/zfs/wiki/dRAID-HOWTO,
they use a "Permutation Development Data Layout", which looks very
promising?

Best,
Best,

Feng

On Fri, Feb 9, 2018 at 10:04 PM John Stoffel <john@xxxxxxxxxxx> wrote:
>
> >>>>> "NeilBrown" == NeilBrown  <neilb@xxxxxxxx> writes:
>
> NeilBrown> On Thu, Feb 08 2018, Wol's lists wrote:
> >> After the de-clustered thread, Neil said it would probably only take a
> >> small amount of coding to do something like that. It was also discussed
> >> about spreading the load over disks on a reconstruction if there were a
> >> lot of disks in an array. I've been trying to get my head round a simple
> >> algorithm to smear data over the disks along the lines of raid-10.
> >>
> >> Basically, the idea is to define a logical stripe which is a multiple of
> >> both the number of physical disks, and of the number of logical disks.
> >> Within this logical stripe the blocks are shuffled using prime numbers
> >> to make sure we don't get a pathological shuffle.
> >>
> >> At present, I've defined the logical stripe to be simply the product of
> >> the number of logical disks times the number of mirrors times the number
> >> of physical disks. We could shrink this by removing common factors, but
> >> we don't need to.
> >>
> >> Given a logical block within this stripe, its physical position is
> >> calculated by the simple equation "logical block * prime mod logical
> >> stripe size". So long as the "prime" does not share any factors with the
> >> logical stripe size, then (with one exception) you're not going to get
> >> hash collisions, and you're not going to get more than one block per
> >> stripe stored on each drive. The exception, of course, is if physical
> >> disks is not greater than logical disks. Having the two identical is
> >> especially dangerous as users will not expect the pathological behaviour
> >> they will get - multiple blocks per stripe stored on the same disk.
> >> mdadm will need to detect and reject this layout. I think the best
> >> behaviour will be found where logical disks, mirrors, and physical disks
> >> don't share any prime factors.
> >>
> >> I've been playing with a mirror setup, and if we have two mirrors, we
> >> can rebuild any failed disk by coping from two other drives. I think
> >> also (I haven't looked at it) that you could do a fast rebuild without
> >> impacting other users of the system too much provided you don't swamp
> >> i/o bandwidth, as half of the requests for data on the three drives
> >> being used for rebuilding could actually be satisfied from other drives.
>
> NeilBrown> I think that ends up being much the same result as a current raid10
> NeilBrown> where the number of copies doesn't divide the number of devices.
> NeilBrown> Reconstruction reads come from 2 different devices, and half the reads
> NeilBrown> that would go to them now go elsewhere.
>
> NeilBrown> I think that if you take your solution and a selection of different
> NeilBrown> "prime" number and rotate through the primes from stripe to stripe, you
> NeilBrown> can expect a more even distribution of load.
>
> NeilBrown> You hint at this below when you suggest that adding the "*prime" doesn't
> NeilBrown> add much.  I think it adds a lot more when you start rotating the primes
> NeilBrown> across the stripes.
>
> >>
> >> Rebuilding a striped raid such as a raid-60 also looks like it would
> >> spread the load.
> >>
> >> The one thing that bothers me is that I don't know whether the "* prime
> >> mod" logic actually adds very much - whether we can just stripe stuff
> >> across like we do with raid-10. Where it will score is in a storage
> >> assembly that is a cluster of a cluster of disks. Say you have a
> >> controller with ten disks, and ten controllers in a rack, a suitable
> >> choice of prime and logical stripe could ensure the rack would survive
> >> losing a controller. And given that dealing with massive arrays is what
> >> this algorithm is about, that seems worthwhile.
>
> NeilBrown> The case of multiple failures is my major concern with the whole idea.
> NeilBrown> If failures were truly independent, then losing 3 in 100 at the same
> NeilBrown> time is probably quite unlikely.  But we all know there are various
> NeilBrown> factors that can cause failures to correlate - controller failure being
> NeilBrown> just one of them.
> NeilBrown> Maybe if you have dual-attached fibre channel to each drive, and you get
> NeilBrown> the gold-plated drives that have actually been exhaustively tested....
>
> NeilBrown> What do large installations do?  I assumed they had lots of modest
> NeilBrown> raid5 arrays and then mirrored between pairs of those (for data they
> NeilBrown> actually care about).  Maybe I'm wrong.
>
> Well, I can only really speak to how Netapp does it, but they have
> what is called RAID-DP (Dual Parity) but they also build up data in
> Aggregates, which is composed of one or more Raid-Groups.  Each raid
> group has 2 parity and X data drives, usually around 14-16 drives.
>
> The idea being that for large setups, you try to put each member of
> the raid group on a seperate shelf (shelves generally have 24 or 48
> drives now, in olded times they had... ummm 12 or 16?  I forget off
> the top of my head).
>
> Part of the idea would be that if you lost and entire shelf of disks,
> you wouldn't lose all your raidgroups/aggregates.
>
> And yes, if you lost a raidgroup, you lost the aggregate.
>
> So paying attention to and planning for disk failures in large numbers
> of disks is a big part of Netapp, even with dual ported SAS links to
> shelves, and dual paths to SAS/SATA drives in those shelves.
>
> I would be honestly scared to have 100 disks holding my data with only
> two parity disks.   Losing three disks just kills you dead, and it's
> going to happen.
>
> For example, you can goto 45drives.com and buy a system with 60 drives
> in it, using LSI controllers and (I think) expander cards.  Building a
> reliable system to handle the failure of a single controller is key
> here.  And I think you can add more controllers, so you reduce your
> number of expander ports, which improves (hopefully!) reliability.
>
> It's all a tradeoff.
>
> What I would do is run some simulations where you setup 100 disks with
> X number of them parity, and the simulate failure rates.  Then include
> the controllers in that simulation and see how likely you are to lose
> all your data before you can rebuild.  It's not an easy problem space
> at all.
>
> John
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html