Re: linux mdadm assembly error: md: cannot handle concurrent replacement and reshape. (reboot while reshaping)

Peter Neuwirth <reddunur@xxxxxxxxx> · Thu, 4 May 2023 11:43:50 +0200

Hello Kuai,

thank you for testing and reproducing this.
I have tried your recreate scenario, but mdadm will not do it due to
some busy devices or maybe the partition tables, that it found on the
hdds..

I started it in the same manner as the initial creation was.

srv11:~# mdadm  --create -f --verbose /dev/md0 -c 256K --level=5 --raid-devices=6  /dev/sde /dev/sdc /dev/sdb /dev/sda /dev/sdi /dev/sdj --assume-clean

mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: /dev/sde appears to be part of a raid array:
      level=raid6 devices=7 ctime=Mon Mar  6 18:17:30 2023
mdadm: partition table exists on /dev/sde but will be lost or
      meaningless after creating array
mdadm: /dev/sdc appears to be part of a raid array:
      level=raid6 devices=7 ctime=Mon Mar  6 18:17:30 2023
mdadm: partition table exists on /dev/sdc but will be lost or
      meaningless after creating array
mdadm: /dev/sdb appears to be part of a raid array:
      level=raid6 devices=7 ctime=Mon Mar  6 18:17:30 2023
mdadm: partition table exists on /dev/sdb but will be lost or
      meaningless after creating array
mdadm: super1.x cannot open /dev/sda: Device or resource busy
mdadm: /dev/sda is not suitable for this array.
mdadm: /dev/sdi appears to be part of a raid array:
      level=raid6 devices=7 ctime=Mon Mar  6 18:17:30 2023
mdadm: partition table exists on /dev/sdi but will be lost or
      meaningless after creating array
mdadm: /dev/sdj appears to be part of a raid array:
      level=raid6 devices=7 ctime=Mon Mar  6 18:17:30 2023
mdadm: partition table exists on /dev/sdj but will be lost or
      meaningless after creating array
mdadm: create aborted

Did you disable the device mapper, or have you any idea how to
stopping my drives being busy?

Not sure if it is related with this but I could remember, that after
first reboot, the raid was yet reported as raid5 but I naively reissued
the commands that, prior to reboot, started the whole process

> mdadm --add /dev/md0 /dev/sdg /dev/sdh
> sudo mdadm --grow /dev/md0 --level=6

and the grow simply aborted with notice that there is a growh ongoing,
but from there on I think the raid set was seen as raid6 in my system.

regards

Peter

Am 04.05.23 um 11:08 schrieb Yu Kuai:
Hi,

在 2023/05/04 16:36, Peter Neuwirth 写道:
Thank you, Kuai!
So my gut instinct was not that bad. Now as I could reassemble my raid set (it tried to recontinue the rebuild, I stopped it)
I have a /dev/md0 but it seems that no sensible data is stored on it. Not even a partition table could be found.

 From your investigations, what would you say : is there hope I could rescue some of the data from the raidset with a tool
like testdisk, when I "recreate" my old gpt partition table ? Or is it likely that the restarted reshape/grow process made
minced meat out of my whole raid data ?
It seemed interesting to me, that the first grow/shape process seemed to not even touch the two added discs, shown as
spare now, their partition tables had not been touched. The process seems to deal only with my legacy raid 5 set with
six plates and seemed to move it to a transient raid5/6 architecture, therefore operating atleast on the disc (3) of legacy
set, that is now missing..
I'm not sure, how much time to spend in this data is sensible,
your advice could be very helpful.

During my test, I'm able to recreat the md0 and mount, but it's just for
reference only...

Test procedure:
mdadm --create --run --verbose /dev/md0 -c 256K --level=5 --raid-devices=6  /dev/sd[abcdef] --size=100M
mdadm -W /dev/md0
mkfs.xfs -f /dev/md0
echo 1024 > /sys/block/md0/md/sync_speed_max

mdadm --add /dev/md0 /dev/sdg /dev/sdh
sudo mdadm --grow /dev/md0 --level=6
sleep 2

echo frozen > /sys/block/md0/md/sync_action

echo system > /sys/block/md0/md/sync_speed_max
echo reshape > /sys/block/md0/md/sync_action
mdadm -W /dev/md0

xfs_repair -n /dev/md0

Above test will reporduce that md0 is corrupted, and this is just
because layout is changed. If I recreated md0 with original disks
with --assume-clean, xfs_repair won't complain and mount will succeed:

[root@fedora ~]# mdadm --create --run --verbose /dev/md0 -c 256K --level=5 --raid-devices=6  /dev/sd[abcdef] --size=100M --assume-clean
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: /dev/sda appears to contain an ext2fs file system
       size=10485760K  mtime=Mon Apr  3 06:18:17 2023
mdadm: /dev/sda appears to be part of a raid array:
       level=raid5 devices=6 ctime=Thu May  4 09:00:08 2023
mdadm: /dev/sdb appears to be part of a raid array:
       level=raid5 devices=6 ctime=Thu May  4 09:00:08 2023
mdadm: /dev/sdc appears to be part of a raid array:
       level=raid5 devices=6 ctime=Thu May  4 09:00:08 2023
mdadm: /dev/sdd appears to be part of a raid array:
       level=raid5 devices=6 ctime=Thu May  4 09:00:08 2023
mdadm: /dev/sde appears to be part of a raid array:
       level=raid5 devices=6 ctime=Thu May  4 09:00:08 2023
mdadm: /dev/sdf appears to be part of a raid array:
       level=raid5 devices=6 ctime=Thu May  4 09:00:08 2023
mdadm: largest drive (/dev/sda) exceeds size (102400K) by more than 1%
mdadm: creation continuing despite oddities due to --run
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
[root@fedora ~]# xfs_repair -n /dev/md0
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
        - 09:05:33: zeroing log - 4608 of 4608 blocks done
        - scan filesystem freespace and inode maps...
        - 09:05:33: scanning filesystem freespace - 8 of 8 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - 09:05:33: scanning agi unlinked lists - 8 of 8 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 7
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - 09:05:33: process known inodes and inode discovery - 64 of 64 inodes done
        - process newly discovered inodes...
        - 09:05:33: process newly discovered inodes - 8 of 8 allocation groups done
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - 09:05:33: setting up duplicate extent list - 8 of 8 allocation groups done
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 6
        - agno = 4
        - agno = 3
        - agno = 7
        - agno = 2
        - agno = 5
        - 09:05:33: check for inodes claiming duplicate blocks - 64 of 64 inodes done
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
        - 09:05:33: verify and correct link counts - 8 of 8 allocation groups done
No modify flag set, skipping filesystem flush and exiting.

Thanks,
Kuai

regards

Peter

Am 04.05.23 um 10:16 schrieb Yu Kuai:
Hi,

在 2023/04/28 5:09, Peter Neuwirth 写道:
Hello linux-raid group.

I have an issue with my linux raid setup and I hope somebody here
could help me get my raid active again without data loss.

I have a debian 11 system with one raid array (6x 1TB hdd drives, raid level 5 )
that was active running till today, when I added two more 1TB hdd drives
and also changed the raid level to 6.

Note: For completition:

My raid setup month ago was

mdadm --create --verbose /dev/md0 -c 256K --level=5 --raid-devices=6  /dev/sdd /dev/sdc /dev/sdb /dev/sda /dev/sdg /dev/sdf

mkfs.xfs -d su=254k,sw=6 -l version=2,su=256k -s size=4k /dev/md0

mdadm --detail --scan | tee -a /etc/mdadm/mdadm.conf

update-initramfs -u

echo '/dev/md0 /mnt/data ext4 defaults,nofail,discard 0 0' | sudo tee -a /etc/fstab

Today I did:

mdadm --add /dev/md0 /dev/sdg /dev/sdh

sudo mdadm --grow /dev/md0 --level=6

This started a growth process, I could observe with
watch -n 1 cat /proc/mdstat
and md0 was still usable all the day.
Due to speedy file access reasons I paused the grow and insertion
process today at about 50% by issue

echo "frozen" > /sys/block/md0/md/sync_action

After the file access was done, I restarted the
process with

echo reshape > /sys/block/md0/md/sync_action

After look into this problem, I figure out that this is how the problem
(corrupted data) triggered in the first place, while the problem that
kernel log about "md: cannot handle concurrent replacement and reshape"
is not fatal.

"echo reshape" will restart the whole process, while recorded reshape
position should be used. This is a seriously kernel bug, I'll try to fix
this soon.

By the way, "echo idle" should avoid this problem.

Thanks,
Kuai

but I saw in mdstat that it started form the scratch.
After about 5 min I noticed, that /dev/dm0 mount was gone with
an input/output error in syslog and I rebooted the computer, to see the
kernel would reassemble dm0 correctly. Maybe the this was a problem,
because the dm0 was still reshaping, I do not know..

.