Re: [PATCH 0/1] Make failure message on re-add more explcit

NeilBrown <neilb@xxxxxxx> · Tue, 10 Apr 2012 09:41:30 +1000

On Thu, 05 Apr 2012 14:59:29 -0400 Doug Ledford <dledford@xxxxxxxxxx> wrote:

> On 02/22/2012 05:04 PM, NeilBrown wrote:
> > On Wed, 22 Feb 2012 17:59:59 +0100 Jes.Sorensen@xxxxxxxxxx wrote:
> > 
> >> From: Jes Sorensen <Jes.Sorensen@xxxxxxxxxx>
> >>
> >> Hi,
> >>
> >> I have seen this come up on the list a couple of times, and also had
> >> bugs filed over it, since this used to 'work'. Making the printed
> >> error message a little more explicit should hopefully make it clearer
> >> why this is being rejected.
> >>
> >> Thoughts?
> 
> My apologies for coming into this late.  However, this change is causing
> support isses, so I looked up the old thread so that I could put my
> comments into context.
> 
> > While I'm always happy to make the error messages more helpful, I don't think
> > this one does :-(
> > 
> > The reason for the change was that people seemed to often use "--add" when
> > what they really wanted was "--re-add".
> 
> This is not an assumption you can make (that people meant re-add instead
> of add when they specifically used add).

I don't assume that they "do" but that they "might" - and I assume this
because observation confirms it.

> 
> > --add will try --re-add first,
> 
> Generally speaking this is fine, but there are instances where re-add
> will never work (such as a device with no bitmap) and mdadm upgrades all
> add attempts to re-add attempts without regard for this fact.

Minor nit: --re-add can work on a device with no bitmap.  If you assemble 4
of the 5 devices in a 5-device RAID5, then re-add the missing device before
any writes, it *should* re-add successfully (I haven't tested lately, but I
think it works).

> 
> > but it if that doesn't succeed it would do the
> > plain add and destroy the metadata.
> 
> Yes.  So?  The user actually passed add in this instance.  If the user
> passes re-add and it fails, we should not automatically attempt to do an
> add.  If the user passes in add, and we attempt a re-add instead and it
> works, then great.  But if the user passes in add, we attempt a re-add
> and fail, then we can't turn around and not even attempt to add or else
> we have essentially just thrown add out the window.  It would no longer
> have any meaning at all.  And that is in fact the case now.  Add is dead
> all except for superblockless devices, for any devices with a superblock
> only re-add does anything useful, and it only works with devices that
> have a bitmap.

While there is some validity in your case you are over-stating it here which
is not a good thing.
--add has not become meaningless (or neutered as you say below).  The only
case where it doesn't work is when we attempted --re-add, and we don't
always do that.   In cases where we don't try --re-add, --add is just as good
as it was before.  There may well be a problem, but it doesn't help to
over-state it.

> 
> > So I introduced the requirement that if you want to destroy metadata, you
> > need to do it explicitly (and I know that won't stop people, but hopefully it
> > will slow them down).
> 
> Yes you did.  You totally neutered add in the process.
> 
> > Also, this is not at all specific to raid1 - it applies equally to
> > raid4/5/6/10.
> 
> I have a user issue already opened because of this change, and I have to
> say I agree with the user completely.  In their case, they set up
> machines that go out to remote locations.  The users at those locations
> are not highly skilled technical people.  This admin does not *ever*
> want those users to run zero-superblock, he's afraid they will zero the
> wrong one.  And he's right.  Before, with the old behavior of add, we at
> least had some sanity checks in place: did this device used to belong to
> this array or no array at all, is it just out of date, does it have a
> higher event counter than the array, etc.  When you used add, mdadm
> could at least perform a few of these sanity checks to make sure things
> are cool and alert the user if they aren't.  But with a workflow of
> zero-superblock/add, there is absolutely no double checks we can perform
> for the user, no ability to do any sanity checking at all, because we
> don't know what the user will be doing with the device next.

Did "--add" perform those sanity checks? or do anything with them?
I don't think so.  It would just do a --re-add if it could and a --add if it
couldn't, and --add would replace the metadata (except the device uuid, but
I don't think anyone cares about that).

--add doesn't do all the same checks that --create does so it really is
(was) a lot like --zero-superblock in its ability to make a mess.

> 
> Neil, this was a *huge* step backwards.  I think you let the idea that
> an add destroys metadata cause a problem here.  Add doesn't destroy
> metadata, it rewrites it.  But in the process it at least has the chance
> to sanity check things.  The zero-superblock/add workflow really *does*
> destroy metadata, and in a much more real way than the old add behavior.
>  I will definitely be reverting this change, and I suggest you do the same.

Also a step forwards I believe.

The real problem I was trying to protect against (I think) was when someone
ends up with a failed RAID5 or RAID6 array and they try to --remove failed
devices and  --add them back in.  This cannot work so the devices get marked
as spares which is bad.
So I probably want to restrict the failure to only happen when the array is
failed.  I have a half-memory that I did that but cannot find the code, so
maybe I only thought of doing it.
RAID1 and RAID10 cannot be failed (we never fail the 'last' device) so it is
probably safe to always allow --add on these arrays.

Could you say more about the sanity checks that you think mdadm did or could
or should do on --add.  Maybe I misunderstood, or maybe there are some useful
improvements we can make there.

Thanks,
NeilBrown

Attachment:
signature.asc

Description: PGP signature