On Mon, Sep 02 2019, Krzysztof Jakóbczyk wrote: > Gentlemen, > > Just in order for me not to mix anything important I will quickly > summarize what I'm about to do: > I will try to release all the files that are being used on the target > md0, by checking what is still being used with "lsof /data" and then > will kill the processes that are still trying to use the array. You won't be able to kill those processes, and there is half a chance that the "lsof /data" will hang and be unkillable. > After the files are being unlocked I will perform the outdated host shutdown. I would sync & wait a little while reboot -f -n A Linux system should always survive "reboot -f -n" with little data loss, usually none. > I will boot a thumbstick on that computer with SystemRescueCD and will > try to assemble the array with the "mdadm --assemble --scan -v --run" > applying --force if necessary. --force shouldn't be necessary, so if the first version doesn't work, check with us first. > > Please confirm me if my understanding is correct. I'd like some more details: particular "mdadm -E" of one or more component drives. I'm curious what the data offset is. As you didn't need to git a "--backup=...." arg to mdadm, I suspect it is reasonably large, which is good. Sometimes raid reshape needs an 'mdadm' running to help the kernel, and if that mdadm gets killed, the reshape will hang. But with a largeish data-offset, no mdadm helper is needed. The hang was reported 307 seconds after a "read error corrected" message. And by that time it had hung for at least 120 seconds - maybe as much as 240. So there isn't obviously a strong connection, but maybe there is a cause/effect there. Looking at code fixes since 3.16, I can see a couple of live-lock bugs fixed, but they were fixed well before 2016-12-30, so probably got back ported to the Debian kernel. So I cannot easily find an explanation. I suspect that if you just rebooted, the reshape would restart and continue happily (unless/until another read error was found). Rebooting to a rescue CD is likely to be safer. Likely worst case is that it will hang again, and we'll need to look more deeply. In any case, I'd like to see that "mdadm --examine" output. Thanks, NeilBrown
Attachment:
signature.asc
Description: PGP signature