Test script (reproducible steps): ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan mdadm --zero-superblock /dev/sd{g,h,i} for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done echo "mdadm create array" mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh echo "set up array on node2" ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg #mdadm --wait /dev/md0 mdadm --grow --raid-devices=2 /dev/md0 ``` sdg/sdh/sdi are 1GB iscsi luns. The shared disks size is more large the issue is more likely to trigger. There is a workaround: when adding the --wait before second --grow, this bug will disappear. There are some different test results after running script: (output by: mdadm -D /dev/md0) <case 1> : normal test result. ``` Number Major Minor RaidDevice State 1 8 112 0 active sync /dev/sdh 2 8 128 1 active sync /dev/sdi ``` <case 2> : "--faulty" data still exist on disk metadata area. ``` Number Major Minor RaidDevice State - 0 0 0 removed 1 8 112 1 active sync /dev/sdh 2 8 128 2 active sync /dev/sdi 0 8 96 - faulty /dev/sdg ``` <case 3> : "--remove" data still exist on disk metadata area. ``` Number Major Minor RaidDevice State - 0 0 0 removed 1 8 112 1 active sync /dev/sdh 2 8 128 2 active sync /dev/sdi ``` Rootcause: In md-cluster env, it doesn't promise the reshape action (by --grow) must take place on current node. Any node in cluster has ability to start resync action, which may be triggered by other node --grow cmd. md-cluster just uses resync_lockres to make sure only one node can do resync job. The key related code (with my patch) is: ``` if (mddev->sync_thread || test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || + test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) || mddev->reshape_position != MaxSector) return -EBUSY; ``` Without test_bit MD_RESYNCING_REMOTE, the 'if' area only handle local recovery/resync event. In this bug, the resyncing was doing on another node (let us call it node2). The initiator side (let us call it node1) start "--grow" cmd, it calls raid1_reshape and return successfully, (please note node1 doesn't do resync job). But in node2 (which does resync job), for handling METADATA_UPDATED (sent by node1), the related code flow: ``` process_metadata_update md_reload_sb check_sb_changes update_raid_disks ``` update_raid_disks returns -EBUSY, but check_sb_changes doesn't handle return value. So the reshape job doesn't be done by node2. At last node2 will use legacy data (e.g. rdev->raid_disks) to update disk metadata. How to fix: The simple & clear solution is block the reshape action in initiator side. When node is executing "--grow" and detecting there is ongoing resyncing, it should immediately return & report error to user space. Signed-off-by: Zhao Heming <heming.zhao@xxxxxxxx> --- drivers/md/md.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 98bac4f304ae..74280e353b8f 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks) return -EINVAL; if (mddev->sync_thread || test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || + test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) || mddev->reshape_position != MaxSector) return -EBUSY; @@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev) } } - if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) - update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); + if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) { + ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); + if (ret) + pr_warn("md: updating array disks failed. %d\n", ret); + } /* * Since mddev->delta_disks has already updated in update_raid_disks, -- 2.27.0