Re: Fwd: mdadm RAID5 to RAID6 migration thrown exceptions, access to data lost

NeilBrown <neilb@xxxxxxx> · Tue, 03 Sep 2019 10:18:47 +1000

On Mon, Sep 02 2019, Krzysztof Jakóbczyk wrote:

> Gentlemen,
>
> Just in order for me not to mix anything important I will quickly
> summarize what I'm about to do:
> I will try to release all the files that are being used on the target
> md0, by checking what is still being used with "lsof /data" and then
> will kill the processes that are still trying to use the array.

You won't be able to kill those processes, and there is half a chance
that the "lsof /data" will hang and be unkillable.

> After the files are being unlocked I will perform the outdated host shutdown.

I would
   sync &
   wait a little while
   reboot -f -n

A Linux system should always survive "reboot -f -n" with little data
loss, usually none.

> I will boot a thumbstick on that computer with SystemRescueCD and will
> try to assemble the array with the "mdadm --assemble --scan -v --run"
> applying --force if necessary.

--force shouldn't be necessary, so if the first version doesn't work,
check with us first.
>
> Please confirm me if my understanding is correct.

I'd like some more details: particular "mdadm -E" of one or more
component drives.  I'm curious what the data offset is.  As you didn't
need to git a "--backup=...." arg to mdadm, I suspect it is reasonably
large, which is good.
Sometimes raid reshape needs an 'mdadm' running to help the kernel, and
if that mdadm gets killed, the reshape will hang.
But with a largeish data-offset, no mdadm helper is needed.

The hang was reported 307 seconds after a "read error corrected"
message. And by that time it had hung for at least 120 seconds - maybe
as much as 240.  So there isn't obviously a strong connection, but maybe
there is a cause/effect there.

Looking at code fixes since 3.16, I can see a couple of live-lock bugs
fixed, but they were fixed well before 2016-12-30, so probably got back
ported to the Debian kernel.

So I cannot easily find an explanation.

I suspect that if you just rebooted, the reshape would restart and
continue happily (unless/until another read error was found).
Rebooting to a rescue CD is likely to be safer.
Likely worst case is that it will hang again, and we'll need to look
more deeply.

In any case, I'd like to see that "mdadm --examine" output.

Thanks,
NeilBrown
Attachment:
signature.asc

Description: PGP signature