Re: Raid5 to raid6 grow interrupted, mdadm hangs on assemble command

Jove <jovetoo@xxxxxxxxx> · Thu, 4 May 2023 20:02:04 +0200

Hi Kuai,

the madm --assemble command also hangs in the kernel. It never completes.

root         142     112  1 19:01 tty1     00:00:00 mdadm --assemble
/dev/md0 /dev/ubdb /dev/ubdc /dev/ubdd /dev/ubde --backup-file
mdadm_raid6_backup.md0 --invalid-backup
root         145       2  0 19:01 ?        00:00:00 [md0_raid6]

[root@LXCNAME ~]# cat /proc/142/stack
[<0>] __switch_to+0x50/0x7f
[<0>] __schedule+0x39c/0x3dd
[<0>] schedule+0x78/0xb9
[<0>] mddev_suspend+0x10b/0x1e8
[<0>] suspend_lo_store+0x72/0xbb
[<0>] md_attr_store+0x6c/0x8d
[<0>] sysfs_kf_write+0x34/0x37
[<0>] kernfs_fop_write_iter+0x167/0x1d0
[<0>] new_sync_write+0x68/0xd8
[<0>] vfs_write+0xe7/0x12b
[<0>] ksys_write+0x6d/0xa6
[<0>] sys_write+0x10/0x12
[<0>] handle_syscall+0x81/0xb1
[<0>] userspace+0x3db/0x598
[<0>] fork_handler+0x94/0x96

[root@LXCNAME ~]# cat /proc/145/stack
[<0>] __switch_to+0x50/0x7f
[<0>] __schedule+0x39c/0x3dd
[<0>] schedule+0x78/0xb9
[<0>] schedule_timeout+0xd2/0xfb
[<0>] md_thread+0x12c/0x18a
[<0>] kthread+0x11d/0x122
[<0>] new_thread_handler+0x81/0xb2

I have had one case in which mdadm didn't hang and in which the
reshape continued. Sadly, I was using sparse overlay files and the
filesystem could not handle the full 4x 4TB. I had to terminate the
reshape.

Best regards,

    Johan

On Thu, May 4, 2023 at 1:41 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> 在 2023/04/24 3:09, Jove 写道:
> > Hi,
> >
> > I've added two drives to my raid5 array and tried to migrate
> > it to raid6 with the following command:
> >
> > mdadm --grow /dev/md0 --raid-devices 4 --level 6
> > --backup-file=/root/mdadm_raid6_backup.md
> >
> > This may have been my first mistake, as there are only 5
> > drives. it should have been --raid-devices 3, I think.
> >
> > As soon as I started this grow, the filesystems went
> > unavailable. All processes trying to access files on it hung.
> > I searched the web which said a reboot during a rebuild
> > was not problematic if things shut down cleanly, so I
> > rebooted. The reboot hung too. The drive activity
> > continued so I let it run overnight. I did wake up to a
> > rebooted system in emergency mode as it could not
> > mount all the partitions on the raid array.
> >
> > The OS tried to reassemble the array and succeeded.
> > However the udev processes that try to create the dev
> > entries hang.
> >
> > I went back to Google and found out how i could reboot
> > my system without this automatic assemble.
> > I tried reassembling the array with:
> >
> > mdadm --verbose --assemble --backup-file mdadm_raid6_backup.md0 /dev/md0
> >
> > This failed with:
> > No backup metadata on mdadm_raid6_backup.md0
> > Failed to find final backup of critical section.
> > Failed to restore critical section for reshape, sorry.
> >
> >   I tried again wtih:
> >
> > mdadm --verbose --assemble --backup-file mdadm_raid6_backup.md0
> > --invalid-backup /dev/md0
> >
> > Rhis said in addition to the lines above:
> >
> > continuying without restoring backup
> >
> > This seemed to have succeeded in reassembling the
> > array but it also hangs indefinitely.
> >
> > /proc/mdstat now shows:
> >
> > md0 : active (read-only) raid6 sdc1[0] sde[4](S) sdf[5] sdd1[3] sdg1[1]
> >        7813771264 blocks super 1.2 level 6, 512k chunk, algorithm 18 [4/3] [UUU_]
> >        bitmap: 1/30 pages [4KB], 65536KB chunk
>
> Read only can't continue reshape progress, see details in
> md_check_recovery(), reshape can only start if md_is_rdwr(mddev) pass.
> Do you know why this array is read-only?
>
> >
> > Again the udev processes trying to access this device hung indefinitely
> >
> > Eventually, the kernel dumps this in my journal:
> >
> > Apr 23 19:17:22 atom kernel: task:systemd-udevd   state:D stack:    0
> > pid: 8121 ppid:   706 flags:0x00000006
> > Apr 23 19:17:22 atom kernel: Call Trace:
> > Apr 23 19:17:22 atom kernel:  <TASK>
> > Apr 23 19:17:22 atom kernel:  __schedule+0x20a/0x550
> > Apr 23 19:17:22 atom kernel:  schedule+0x5a/0xc0
> > Apr 23 19:17:22 atom kernel:  schedule_timeout+0x11f/0x160
> > Apr 23 19:17:22 atom kernel:  ? make_stripe_request+0x284/0x490 [raid456]
> > Apr 23 19:17:22 atom kernel:  wait_woken+0x50/0x70
>
> Looks like this normal io is waiting for reshape to be done, that's why
> it hanged indefinitely.
>
> This really is a kernel bug, perhaps it can be bypassed if reshape can
> be done, hopefully automatically if this array can be read/write. Noted
> never echo reshape to sync_action, this will corrupt data in your case.
>
> Thanks,
> Kuai
>