Re: Roadmap for md/raid ???

Bernd Schubert <bs@xxxxxxxxx> · Fri, 19 Dec 2008 17:13:34 +0100

But multiple rebuilds are already supported. If you have multiple arrays on 
drive partitions and the CPU is the limit, you may want to 
set /sys/block/mdX/md/sync_force_parallel to one.

Cheers,
Bernd

On Friday 19 December 2008 16:51:24 Justin Piszcz wrote:
> Or, before that, allow multiple arrays to rebuild on each core of the
> CPU(s), one per array.
>
> Justin.
>
> On Fri, 19 Dec 2008, Chris Worley wrote:
> > How about "parallelized parity calculation"... given SSD I/O
> > performance, parity calculations are now the performance bottleneck.
> > Most systems have plenty of CPU's to do parity calculations in
> > parallel.  Parity calculations are embarrassingly parallel (no
> > dependence between the domains in a domain distribution).
> >
> > Chris
> >
> > On Thu, Dec 18, 2008 at 9:10 PM, Neil Brown <neilb@xxxxxxx> wrote:
> >> Not really a roadmap, more a few tourist attractions that you might
> >> see on the way if you stick around (and if I stick around)...
> >>
> >> Comments welcome.
> >>
> >> NeilBrown
> >>
> >>
> >> - Bad block list
> >>  The idea here is to maintain and store on each device a list of
> >>  blocks that are known to be 'bad'.  This effectively allows us to
> >>  fail a single block rather than a whole device when we get a media
> >>  write error.  Of course if updating the bad-block-list gives an
> >>  error we then have to fail the device.
> >>
> >>  We would also record a bad block if we get a read error on a degraded
> >>  array.  This would e.g. allow recovery for a degraded raid1 where the
> >>  sole remaining device has a bad block.
> >>
> >>  An array could have multiple errors on different devices and just
> >>  those stripes would be considered to be "degraded".  As long a no
> >>  single stripe had too many bad blocks, the data would still be safe.
> >>  Naturally as soon as you get one bad block, the array becomes
> >>  susceptible to data loss on a single device failure, so it wouldn't
> >>  be advisable to run with non-empty badblock lists for an extended
> >>  length of time,  However it might provide breathing space until
> >>  drive replacement can be achieved.
> >>
> >> - hot-device-replace
> >>  This is probably the most asked for feature of late.  It would allow
> >>  a device to be 'recovered' while the original was still in service.
> >>  So instead of failing out a device and adding a spare, you can add
> >>  the spare, build the data onto it, then fail out the device.
> >>
> >>  This meshes well with the bad block list.  When we find a bad block,
> >>  we start a hot-replace onto a spare (if one exists).  If sleeping
> >>  bad blocks are discovered during the hot-replace process, we don't
> >>  lose the data unless we find two bad blocks in the same stripe.
> >>  And then we just lose data in that stripe.
> >>
> >>  Recording in the metadata that a hot-replace was happening might be
> >>  a little tricky, so it could be that if you reboot in the middle,
> >>  you would have to restart from the beginning.  Similarly there would
> >>  be no 'intent' bitmap involved for this resync.
> >>
> >>  Each personality would have to implement much of this independently,
> >>  effectively providing a mini raid1 implementation.  It would be very
> >>  minimal without e.g. read balancing or write-behind etc.
> >>
> >>  There would be no point implementing this in raid1.  Just
> >>  raid456 and raid10.
> >>  It could conceivably make sense for raid0 and linear, but that is
> >>  very unlikely to be implemented.
> >>
> >> - split-mirror
> >>  This is really a function of mdadm rather than md.  It is already
> >>  quite possible to break a mirror into two separate single-device
> >>  arrays.  However it is a sufficiently common operation that it is
> >>  probably making it very easy to do with mdadm.
> >>  I'm thinking something like
> >>      mdadm --create /dev/md/new --split /dev/md/old
> >>
> >>  will create a new raid1 by taking one device off /dev/md/old (which
> >>  must be a raid1) and making an array with exactly the right metadata
> >>  and size.
> >>
> >> - raid5->raid6 conversion.
> >>   This is also a fairly commonly asked for feature.
> >>   The first step would be to define a raid6 layout where the Q block
> >>   was not rotated around the devices but was always on the last
> >>   device.  Then we could change a raid5 to a singly-degraded raid6
> >>   without moving any data.
> >>
> >>   The next step would be to implement in-place restriping.
> >>   This involves
> >>      - freezing a section of the array (all IO blocks)
> >>      - copying the data out to a safe backup
> >>      - copying it back in with the new layout
> >>      - updating the metadata to indicate that the restripe has
> >>        progressed.
> >>      - repeat.
> >>
> >>   This would probably be quite slow but it would achieve the desired
> >>   result.
> >>
> >>   Once we have in-place restriping we could change chunksize as
> >>   well.
> >>
> >> - raid5 reduce number of devices.
> >>   We can currently restripe a raid5 (or 6) over a larger number of
> >>   devices but not over a smaller number of devices.  That means you
> >>   cannot undo an increase that you didn't want.
> >>
> >>   It might be nice to allow this to happen at the same time as
> >>   increasing --size (if the devices are big enough) to allow the
> >>   array to be restriped without changing the available space.
> >>
> >> - cluster raid1
> >>   Allow a raid1 to be assembled on multiple hosts that share some
> >>   drives, so a cluster filesystem (e.g. ocfs2) can be run over it.
> >>   It requires co-ordination to handle failure events and
> >>   resync/recovery.  Most of this would probably be done in userspace.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Bernd Schubert
Q-Leap Networks GmbH
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html