Re: Fwd: mdadm RAID5 to RAID6 migration thrown exceptions, access to data lost

Krzysztof Jakóbczyk <krzysiek.jakobczyk@xxxxxxxxx> · Mon, 2 Sep 2019 18:00:50 +0200

Gentlemen,

Just in order for me not to mix anything important I will quickly
summarize what I'm about to do:
I will try to release all the files that are being used on the target
md0, by checking what is still being used with "lsof /data" and then
will kill the processes that are still trying to use the array.
After the files are being unlocked I will perform the outdated host shutdown.
I will boot a thumbstick on that computer with SystemRescueCD and will
try to assemble the array with the "mdadm --assemble --scan -v --run"
applying --force if necessary.

Please confirm me if my understanding is correct.

Best regards,
Krzysztof Jakobczyk

pon., 2 wrz 2019 o 16:32 Phil Turmel <philip@xxxxxxxxxx> napisał(a):
>
> Good morning Krzysztof,
>
> On 9/2/19 7:30 AM, Krzysztof Jakóbczyk wrote:
> > Thank you for your input and I'll wait with further steps until confirmation!
> >
> > Best regards,
> > Krzysztof Jakobczyk
> >
> > pon., 2 wrz 2019 o 12:52 Wols Lists <antlists@xxxxxxxxxxxxxxx> napisał(a):
> >>
> >> On 02/09/19 11:05, Krzysztof Jakóbczyk wrote:
> >>> My questions are the following:
> >>>
> >>> What to do in order to move the reshape process forward?
> >>
> >> I'll leave that to others, but my gut reaction is just to restart it
> >> (don't follow my advice! Wait for someone else to say it's safe :-)
>
> Don't do anything more in your current kernel and mdadm version.
>
> >>> Do you think the data on the md0 is safe?
> >>
> >> Yes I do.
>
> I agree.
>
> >>> How to access the data on md0 if I cannot cd to it?
> >>>
> >> Wait for the system to (be) recover(ed).
> >>
> >>> What are those stack traces in the dmesg output?
>
> Those are from an unrelated process (postgres) that is stuck.  It might
> be stuck as a side effect of not being able to reach data on your array.
>
> >>> Help will be greatly appreciated.
> >>>
> >> MAKE SURE you've got a rescue disk with the latest mdadm and an
> >> up-to-date kernel. I strongly suspect you've got an out-of-date system -
> >> mdadm 3.2.2 is pretty ancient. This sounds to me like a well-known
> >> problem from back then, and if I'm right the fix is as simple as booting
> >> into a up-to-date recovery system, letting the reshape complete, and
> >> then booting back into the old system.
> >>
> >> Can someone else confirm, please!!!
>
> Yes, this is what I would do.  Do as clean a shutdown as you can on your
> system as-is.  Reboot into a rescue environment that has a current
> mdadm.  (I am a fan of SystemRescueCD, on a thumb drive, but others
> should work fine too.)
>
> Note that device names may change from kernel to kernel--you will want
> to use smartctl to verify which drive serial number is on which device
> name and adjust your command lines accordingly.
>
> You will likely have to use --assemble --force, specifying all relevant
> devices, as I doubt the current kernel will cleanly shutdown, and
> therefore some superblock data will prevent auto-start.  If you used a
> backup file in your reshape command, you will need to supply it to your
> --assembly command.  (Backup files are not generally needed, and reduce
> reshape performance.)
>
> If reshape does not resume, supply the output of "mdadm -E" for all of
> your member partitions, and "smartctl -iA -l srterc" for the devices.
> When you paste the above into your email client, turn off word wrapping
> so the long lines won't be mangled.
>
> >> Cheers,
> >> Wol
>
> Regards,
>
> Phil