Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Thu, 4 May 2023 19:41:21 +0800

Hi,

在 2023/04/24 3:09, Jove 写道:
Hi,

I've added two drives to my raid5 array and tried to migrate
it to raid6 with the following command:

mdadm --grow /dev/md0 --raid-devices 4 --level 6
--backup-file=/root/mdadm_raid6_backup.md

This may have been my first mistake, as there are only 5
drives. it should have been --raid-devices 3, I think.

As soon as I started this grow, the filesystems went
unavailable. All processes trying to access files on it hung.
I searched the web which said a reboot during a rebuild
was not problematic if things shut down cleanly, so I
rebooted. The reboot hung too. The drive activity
continued so I let it run overnight. I did wake up to a
rebooted system in emergency mode as it could not
mount all the partitions on the raid array.

The OS tried to reassemble the array and succeeded.
However the udev processes that try to create the dev
entries hang.

I went back to Google and found out how i could reboot
my system without this automatic assemble.
I tried reassembling the array with:

mdadm --verbose --assemble --backup-file mdadm_raid6_backup.md0 /dev/md0

This failed with:
No backup metadata on mdadm_raid6_backup.md0
Failed to find final backup of critical section.
Failed to restore critical section for reshape, sorry.

  I tried again wtih:

mdadm --verbose --assemble --backup-file mdadm_raid6_backup.md0
--invalid-backup /dev/md0

Rhis said in addition to the lines above:

continuying without restoring backup

This seemed to have succeeded in reassembling the
array but it also hangs indefinitely.

/proc/mdstat now shows:

md0 : active (read-only) raid6 sdc1[0] sde[4](S) sdf[5] sdd1[3] sdg1[1]
       7813771264 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUU_]
       bitmap: 1/30 pages [4KB], 65536KB chunk

Read only can't continue reshape progress, see details in
md_check_recovery(), reshape can only start if md_is_rdwr(mddev) pass.
Do you know why this array is read-only?

Again the udev processes trying to access this device hung indefinitely

Eventually, the kernel dumps this in my journal:

Apr 23 19:17:22 atom kernel: task:systemd-udevd   state:D stack:    0
pid: 8121 ppid:   706 flags:0x00000006
Apr 23 19:17:22 atom kernel: Call Trace:
Apr 23 19:17:22 atom kernel:  <TASK>
Apr 23 19:17:22 atom kernel:  __schedule+0x20a/0x550
Apr 23 19:17:22 atom kernel:  schedule+0x5a/0xc0
Apr 23 19:17:22 atom kernel:  schedule_timeout+0x11f/0x160
Apr 23 19:17:22 atom kernel:  ? make_stripe_request+0x284/0x490 [raid456]
Apr 23 19:17:22 atom kernel:  wait_woken+0x50/0x70

Looks like this normal io is waiting for reshape to be done, that's why
it hanged indefinitely.

This really is a kernel bug, perhaps it can be bypassed if reshape can
be done, hopefully automatically if this array can be read/write. Noted
never echo reshape to sync_action, this will corrupt data in your case.

Thanks,
Kuai