Re: linux mdadm assembly error: md: cannot handle concurrent replacement and reshape. (reboot while reshaping)

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Thu, 4 May 2023 16:16:50 +0800

Hi,

在 2023/04/28 5:09, Peter Neuwirth 写道:
Hello linux-raid group.

I have an issue with my linux raid setup and I hope somebody here
could help me get my raid active again without data loss.

I have a debian 11 system with one raid array (6x 1TB hdd drives, raid 
level 5 )
that was active running till today, when I added two more 1TB hdd drives
and also changed the raid level to 6.

Note: For completition:

My raid setup month ago was

mdadm --create --verbose /dev/md0 -c 256K --level=5 --raid-devices=6  
/dev/sdd /dev/sdc /dev/sdb /dev/sda /dev/sdg /dev/sdf

mkfs.xfs -d su=254k,sw=6 -l version=2,su=256k -s size=4k /dev/md0

mdadm --detail --scan | tee -a /etc/mdadm/mdadm.conf

update-initramfs -u

echo '/dev/md0 /mnt/data ext4 defaults,nofail,discard 0 0' | sudo tee -a 
/etc/fstab

Today I did:

mdadm --add /dev/md0 /dev/sdg /dev/sdh

sudo mdadm --grow /dev/md0 --level=6

This started a growth process, I could observe with
watch -n 1 cat /proc/mdstat
and md0 was still usable all the day.
Due to speedy file access reasons I paused the grow and insertion
process today at about 50% by issue

echo "frozen" > /sys/block/md0/md/sync_action

After the file access was done, I restarted the
process with

echo reshape > /sys/block/md0/md/sync_action

After look into this problem, I figure out that this is how the problem
(corrupted data) triggered in the first place, while the problem that
kernel log about "md: cannot handle concurrent replacement and reshape"
is not fatal.

"echo reshape" will restart the whole process, while recorded reshape
position should be used. This is a seriously kernel bug, I'll try to fix
this soon.

By the way, "echo idle" should avoid this problem.

Thanks,
Kuai

but I saw in mdstat that it started form the scratch.
After about 5 min I noticed, that /dev/dm0 mount was gone with
an input/output error in syslog and I rebooted the computer, to see the
kernel would reassemble dm0 correctly. Maybe the this was a problem,
because the dm0 was still reshaping, I do not know..