Re: Recovering a RAID6 after all disks were disconnected

Giuseppe Bilotta <giuseppe.bilotta@xxxxxxxxx> · Sat, 24 Dec 2016 15:34:14 +0100

On Fri, Dec 23, 2016 at 11:46 PM, NeilBrown <neilb@xxxxxxxx> wrote:
> On Sat, Dec 24 2016, Giuseppe Bilotta wrote:
>>
>> Now I wonder if it it would be possible to combine this approach with
>> something that simply hacked the metadata of each disk to re-establish
>> the correct disk order to make it possible to reassemble this
>> particular array without recreating anything. Are problems such as
>> mine common enough to warrant support for this kind of verified
>> reassembly from assumed-clean disks easier?.
>
> The way I look at this sort of question is to ask "what is the root
> cause?", and then "What is the best response to the consequences of that
> root cause?".
>
> In your case, I would look at the sequence of event that lead to you
> needing to re-create your array, and ask "At which point could md or
> mdadm done something differently?".
>
> If you, or someone, can describe precisely how to reproduce your outcome
> - so that I can reproduce it myself - then I'll happily have a look and
> see at which point something different could have happened.

As I mentioned on the first post, the root of the issue is cheap
hardware plus user error. Basically, all disks in this RAID are hosted
on a JBOD that has a tendency to 'disappear' at times. I've seen this
happen generally when one of the disks acts up (in which case Linux
attempting to reset it leads to a reset of the whole JBOD, which makes
all disks disappear until the device recovers). The JBOD is connected
via USB3, but I had the same issues when using an eSATA connection
with port multiplexer, and from what I've read around it's a known
limitation of SATA (as opposed to professional stuff based on SAS).

When this happens, md ends up removing all devices from the RAID. The
proper way to handle this, I've found, is to unmount the filesystem,
stop the array, and then reassemble it and remount it as soon as the
JBOD is back online. With this approach the RAID recovers in pretty
good shape (aside from the disk that is acting up, possibly). However,
it's a bit bothersome and may take some time to free up all filesystem
usage to allow for the unmounting, sometimes to the point of requiring
a reboot. So the last time this happened I tried something different,
and I made the mistake of trying a re-add of all the disks. This
resulted in the disks being marked as spares because md could not
restore the RAID functionality after having dropped to 0 disks.

I'm not sure this could be handled differently, unless mdraid could be
made to not kick all disks out if the whole JBOD disappears, but
rather wait for it to come back?

-- 
Giuseppe "Oblomov" Bilotta
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html