Re: mdadm: failed devices become spares!

Neil Brown <neilb@xxxxxxx> · Tue, 18 May 2010 12:06:37 +1000

On Tue, 18 May 2010 11:30:16 +1000
Neil Brown <neilb@xxxxxxx> wrote:

> On Mon, 17 May 2010 20:10:36 +0200
> Pierre Vignéras <pierre@xxxxxxxxxxxxx> wrote:
> 
> > Did I miss something, or is there something really strange happening there?
> 
> Something strange...
> I cannot explain the 'SpareActive' messages.

Actually I can explain that I think.

When a device fails it gets marked as faulty, then as soon as there is no
more pending IO it gets moved out of the array.  "mdadm -D" will show it with
a larger 'Number' and a 'RaidDevice' of '-'.
Normally these happen almost as a single operation, though a lot of pending
IO can slow it down.

"mdadm --monitor" identified devices based on 'Number', so it would normally
see a working device disappear - which is reported a a failure, and a
'faulty/spare' device appear, which it ignores.

However if --monitor gets to check the array between the above to events, it
will first see that the working drive is now faulty, so it reports a failure,
and then see that the faulty device isn't faulty any more and in fact isn't
even there.  The "isn't event there" bit doesn't register and it treats it as
'SpareActive'.

I should fix that.

So I'm quite sure now that your devices didn't really become spares until you
removed and added them, which is exactly they way to turn failed devices
into spares.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html