Re: linux mdadm assembly error: md: cannot handle concurrent replacement and reshape. (reboot while reshaping)

Peter Neuwirth <reddunur@xxxxxxxxx> · Thu, 4 May 2023 10:36:55 +0200

Thank you, Kuai!
So my gut instinct was not that bad. Now as I could reassemble my raid set (it tried to recontinue the rebuild, I stopped it)
I have a /dev/md0 but it seems that no sensible data is stored on it. Not even a partition table could be found.

From your investigations, what would you say : is there hope I could rescue some of the data from the raidset with a tool
like testdisk, when I "recreate" my old gpt partition table ? Or is it likely that the restarted reshape/grow process made
minced meat out of my whole raid data ?
It seemed interesting to me, that the first grow/shape process seemed to not even touch the two added discs, shown as
spare now, their partition tables had not been touched. The process seems to deal only with my legacy raid 5 set with
six plates and seemed to move it to a transient raid5/6 architecture, therefore operating atleast on the disc (3) of legacy
set, that is now missing..
I'm not sure, how much time to spend in this data is sensible,
your advice could be very helpful.

regards

Peter

Am 04.05.23 um 10:16 schrieb Yu Kuai:
Hi,

在 2023/04/28 5:09, Peter Neuwirth 写道:
Hello linux-raid group.

I have an issue with my linux raid setup and I hope somebody here
could help me get my raid active again without data loss.

I have a debian 11 system with one raid array (6x 1TB hdd drives, raid level 5 )
that was active running till today, when I added two more 1TB hdd drives
and also changed the raid level to 6.

Note: For completition:

My raid setup month ago was

mdadm --create --verbose /dev/md0 -c 256K --level=5 --raid-devices=6  /dev/sdd /dev/sdc /dev/sdb /dev/sda /dev/sdg /dev/sdf

mkfs.xfs -d su=254k,sw=6 -l version=2,su=256k -s size=4k /dev/md0

mdadm --detail --scan | tee -a /etc/mdadm/mdadm.conf

update-initramfs -u

echo '/dev/md0 /mnt/data ext4 defaults,nofail,discard 0 0' | sudo tee -a /etc/fstab

Today I did:

mdadm --add /dev/md0 /dev/sdg /dev/sdh

sudo mdadm --grow /dev/md0 --level=6

This started a growth process, I could observe with
watch -n 1 cat /proc/mdstat
and md0 was still usable all the day.
Due to speedy file access reasons I paused the grow and insertion
process today at about 50% by issue

echo "frozen" > /sys/block/md0/md/sync_action

After the file access was done, I restarted the
process with

echo reshape > /sys/block/md0/md/sync_action

After look into this problem, I figure out that this is how the problem
(corrupted data) triggered in the first place, while the problem that
kernel log about "md: cannot handle concurrent replacement and reshape"
is not fatal.

"echo reshape" will restart the whole process, while recorded reshape
position should be used. This is a seriously kernel bug, I'll try to fix
this soon.

By the way, "echo idle" should avoid this problem.

Thanks,
Kuai

but I saw in mdstat that it started form the scratch.
After about 5 min I noticed, that /dev/dm0 mount was gone with
an input/output error in syslog and I rebooted the computer, to see the
kernel would reassemble dm0 correctly. Maybe the this was a problem,
because the dm0 was still reshaping, I do not know..