Re: Roadmap for md/raid ???

"Chris Worley" <worleys@xxxxxxxxx> · Fri, 19 Dec 2008 08:44:57 -0700

How about "parallelized parity calculation"... given SSD I/O
performance, parity calculations are now the performance bottleneck.
Most systems have plenty of CPU's to do parity calculations in
parallel.  Parity calculations are embarrassingly parallel (no
dependence between the domains in a domain distribution).

Chris
On Thu, Dec 18, 2008 at 9:10 PM, Neil Brown <neilb@xxxxxxx> wrote:
>
>
> Not really a roadmap, more a few tourist attractions that you might
> see on the way if you stick around (and if I stick around)...
>
> Comments welcome.
>
> NeilBrown
>
>
> - Bad block list
>  The idea here is to maintain and store on each device a list of
>  blocks that are known to be 'bad'.  This effectively allows us to
>  fail a single block rather than a whole device when we get a media
>  write error.  Of course if updating the bad-block-list gives an
>  error we then have to fail the device.
>
>  We would also record a bad block if we get a read error on a degraded
>  array.  This would e.g. allow recovery for a degraded raid1 where the
>  sole remaining device has a bad block.
>
>  An array could have multiple errors on different devices and just
>  those stripes would be considered to be "degraded".  As long a no
>  single stripe had too many bad blocks, the data would still be safe.
>  Naturally as soon as you get one bad block, the array becomes
>  susceptible to data loss on a single device failure, so it wouldn't
>  be advisable to run with non-empty badblock lists for an extended
>  length of time,  However it might provide breathing space until
>  drive replacement can be achieved.
>
> - hot-device-replace
>  This is probably the most asked for feature of late.  It would allow
>  a device to be 'recovered' while the original was still in service.
>  So instead of failing out a device and adding a spare, you can add
>  the spare, build the data onto it, then fail out the device.
>
>  This meshes well with the bad block list.  When we find a bad block,
>  we start a hot-replace onto a spare (if one exists).  If sleeping
>  bad blocks are discovered during the hot-replace process, we don't
>  lose the data unless we find two bad blocks in the same stripe.
>  And then we just lose data in that stripe.
>
>  Recording in the metadata that a hot-replace was happening might be
>  a little tricky, so it could be that if you reboot in the middle,
>  you would have to restart from the beginning.  Similarly there would
>  be no 'intent' bitmap involved for this resync.
>
>  Each personality would have to implement much of this independently,
>  effectively providing a mini raid1 implementation.  It would be very
>  minimal without e.g. read balancing or write-behind etc.
>
>  There would be no point implementing this in raid1.  Just
>  raid456 and raid10.
>  It could conceivably make sense for raid0 and linear, but that is
>  very unlikely to be implemented.
>
> - split-mirror
>  This is really a function of mdadm rather than md.  It is already
>  quite possible to break a mirror into two separate single-device
>  arrays.  However it is a sufficiently common operation that it is
>  probably making it very easy to do with mdadm.
>  I'm thinking something like
>      mdadm --create /dev/md/new --split /dev/md/old
>
>  will create a new raid1 by taking one device off /dev/md/old (which
>  must be a raid1) and making an array with exactly the right metadata
>  and size.
>
> - raid5->raid6 conversion.
>   This is also a fairly commonly asked for feature.
>   The first step would be to define a raid6 layout where the Q block
>   was not rotated around the devices but was always on the last
>   device.  Then we could change a raid5 to a singly-degraded raid6
>   without moving any data.
>
>   The next step would be to implement in-place restriping.
>   This involves
>      - freezing a section of the array (all IO blocks)
>      - copying the data out to a safe backup
>      - copying it back in with the new layout
>      - updating the metadata to indicate that the restripe has
>        progressed.
>      - repeat.
>
>   This would probably be quite slow but it would achieve the desired
>   result.
>
>   Once we have in-place restriping we could change chunksize as
>   well.
>
> - raid5 reduce number of devices.
>   We can currently restripe a raid5 (or 6) over a larger number of
>   devices but not over a smaller number of devices.  That means you
>   cannot undo an increase that you didn't want.
>
>   It might be nice to allow this to happen at the same time as
>   increasing --size (if the devices are big enough) to allow the
>   array to be restriped without changing the available space.
>
> - cluster raid1
>   Allow a raid1 to be assembled on multiple hosts that share some
>   drives, so a cluster filesystem (e.g. ocfs2) can be run over it.
>   It requires co-ordination to handle failure events and
>   resync/recovery.  Most of this would probably be done in userspace.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html