Re: [PATCH] DM-RAID: Fix RAID10's check for sufficient redundancy

NeilBrown <neilb@xxxxxxx> · Thu, 10 Jan 2013 17:03:40 +1100

On Wed, 09 Jan 2013 20:02:32 -0600 Jonathan Brassow <jbrassow@xxxxxxxxxx>
wrote:

> 
> On Wed, 2013-01-09 at 20:00 -0600, Jonathan Brassow wrote:
> > DM RAID:  Fix RAID10's check for sufficient redundancy
> > 
> > Before attempting to activate a RAID array, it is checked for sufficient
> > redundancy.  That is, we make sure that there are not too many failed
> > devices - or devices specified for rebuild - to undermine our ability to
> > activate the array.  The current code performs this check twice - once to
> > ensure there were not too many devices specified for rebuild by the user
> > ('validate_rebuild_devices') and again after possibly experiencing a failure
> > to read the superblock ('analyse_superblocks').  Neither of these checks are
> > sufficient.  The first check is done properly but with insufficient
> > information about the possible failure state of the devices to make a good
> > determination if the array can be activated.  The second check is simply
> > done wrong in the case of RAID10 because it doesn't account for the
> > independence of the stripes (i.e. mirror sets).  The solution is to use the
> > properly written check ('validate_rebuild_devices'), but perform the check
> > after the superblocks have been read and we know which devices have failed.
> > This gives us one check instead of two and performs it in a location where
> > it can be done right.
> > 
> > Only RAID10 was affected and it was affected in the following ways:
> > - the code did not properly catch the condition where a user specified
> >   a device for rebuild that already had a failed device in the same mirror
> >   set.  (This condition would, however, be caught at a deeper level in MD.)
> > - the code triggers a false positive and denies activation when devices in
> >   independent mirror sets have failed - counting the failures as though they
> >   were all in the same set.
> > 
> > The most likely place this error was introduced (or this patch should have
> > been included) is in commit 4ec1e369 - first introduced in v3.7-rc1.
> > 
> > Signed-off-by: Jonathan Brassow <jbrassow@xxxxxxxxxx>
> 
> Neil,
> 
> This patch should apply cleanly on top of recent changes, but will not
> apply cleanly on an older kernel (like 3.7) due to version # changes and
> introduction of the new 'far' and 'offset' RAID10 algorithms.  If you
> think this fix should be pushed back into 3.7 rather than just applying
> on the latest code, I will make a patch for 3.7 - although I'm not
> certain how I'd handle the version number conflict.

Thanks Jon, I've applied that patch.

I can't say if it should be back-ported.  What is the worst-case scenario if
something goes wrong?  Do the current user-space code contain appropriate
tests itself too?

I don't see how version numbers will be a concern - they are just comments in
a 'Documentation' file aren't they?   If necessary, add extra text to explain
any possible confusion.

NeilBrown
Attachment:
signature.asc

Description: PGP signature