Re: What the heck happened to my array?

Brad Campbell <lists2009@xxxxxxxxxxxxxxx> · Tue, 05 Apr 2011 17:02:43 +0800

On 05/04/11 14:10, NeilBrown wrote:
>> - Reboot required to get system back.
>> - Restarted reshape with 9 drives.
>> - sdl suffered IO error and was kicked
>
> Very sad.

I'd say pretty damn unlucky actually.

>> - Array froze all IO.
>
> Same thing...
>
>> - Reboot required to get system back.
>> - Array will no longer mount with 8/10 drives.
>> - Mdadm 3.1.5 segfaults when trying to start reshape.
>
> Don't know why it would have done that... I cannot reproduce it easily.

No. I tried numerous incantations. The system version of mdadm is Debian 
3.1.4. This segfaulted so I downloaded and compiled 3.1.5 which did the 
same thing. I then composed most of this E-mail, made *really* sure my 
backups were up to date and tried 3.2.1 which to my astonishment worked. 
It's been ticking along _slowly_ ever since.

>>     Naively tried to run it under gdb to get a backtrace but was unable
>> to stop it forking
>
> Yes, tricky .... an "strace -o /tmp/file -f mdadm ...." might have been
> enough, but to late to worry about that now.

I wondered about using strace but for some reason got it into my head 
that a gdb backtrace would be more useful. Then of course I got it 
started with 3.2.1 and have not tried again.

>> - Got array started with mdadm 3.2.1
>> - Attempted to re-add sdd/sdl (now marked as spares)
>
> Hmm... it isn't meant to do that any more.  I thought I fixed it so 
that it
> if a device looked like part of the array it wouldn't add it as a 
spare...
> Obviously that didn't work.  I'd better look in to it again.

Now the chain of events that led up to this was along these lines.
- Rebooted machine.
- Tried to --assemble with 3.1.4
- mdadm told me it did not really want to continue with 8/10 devices and 
I should use --force if I really wanted it to try.
- I used --force
- I did a mdadm --add /dev/md0 /dev/sdd and the same for sdl
- I checked and they were listed as spares.

So this was all done with Debian's mdadm 3.1.4, *not* 3.1.5

>
> No, you cannot give it extra redundancy.
> I would suggest:
>    copy anything that you need off, just in case - if you can.
>
>    Kill the mdadm that is running in the back ground.  This will mean 
that
>    if the machine crashes your array will be corrupted, but you are 
thinking
>    of rebuilding it any, so that isn't the end of the world.
>    In /sys/block/md0/md
>       cat suspend_hi>  suspend_lo
>       cat component_size>  sync_max
>
>    That will allow the reshape to continue without any backup.  It 
will be
>    much faster (but less safe, as I said).

Well, I have nothing to lose, but I've just picked up some extra drives 
so I'll make second backups and then give this a whirl.

>    If something goes wrong, you will need to scrap the array, 
recreate it, and
>    copy data back from where-ever you copied it to (or backups).

I did go into this with the niggling feeling that something bad might 
happen, so I made sure all my backups were up to date before I started. 
No biggie if it does die.

The very odd thing is I did a complete array check, plus SMART long 
tests on all drives literally hours before I started the reshape. Goes 
to show how ropey these large drives can be in big(iash) arrays.

> If anything there doesn't make sense, or doesn't seem to work - 
please ask.
>
> Thanks for the report.  I'll try to get those mdadm issues addressed -
> particularly if you can get me the mdadm file which caused the segfault.
>

Well, luckily I preserved the entire build tree then. I was planning on 
running nm over the binary and have a two thumbs type of look into it 
with gdb, but seeing as you probably have a much better idea what you 
are looking for I'll just send you the binary!

Thanks for the help Neil. Much appreciated.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html