In cluster env, a node can start resync job when the resync cmd was executed on a different node. Reshape requests should be blocked for resync job initiated by any node. Current code only checks local resync condition to block reshape requests. This results in a chaos result when resync was doing on different node. Fix this by adding the extra check MD_RESYNCING_REMOTE. *** Repro steps *** Test env node A & B share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. The disk size is more large the issue is more likely to trigger. (more resync time, more easily trigger issue) Test script ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan mdadm --zero-superblock /dev/sd{g,h,i} for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done echo "mdadm create array" mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh echo "set up array on node2" ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg #mdadm --wait /dev/md0 mdadm --grow --raid-devices=2 /dev/md0 ``` There is a workaround: when adding the --wait before second --grow, the issue will disappear. Test results Every cmd line in test script can be finish, but array status is wrong. There are 3 results. (part of output by: mdadm -D /dev/md0) <case 1> : normal test result. ``` Number Major Minor RaidDevice State 1 8 112 0 active sync /dev/sdh 2 8 128 1 active sync /dev/sdi ``` <case 2> : "--faulty" data still exist on disk metadata area. ``` Number Major Minor RaidDevice State - 0 0 0 removed 1 8 112 1 active sync /dev/sdh 2 8 128 2 active sync /dev/sdi 0 8 96 - faulty /dev/sdg ``` <case 3> : "--remove" data still exist on disk metadata area. ``` Number Major Minor RaidDevice State - 0 0 0 removed 1 8 112 1 active sync /dev/sdh 2 8 128 2 active sync /dev/sdi ``` *** Analysis *** code flow of the problem scenario ``` node A node B ------------------------------ --------------- mdadm --grow -n 3 md0 + raid1_reshape mddev->raid_disks: 2=>3 start resync job, it will block A resync job mddev->raid_disks: 2=>3 mdadm md0 --fail sdg + update disk: array sb & bitmap sb + send METADATA_UPDATE (resync job blocking) (B continue doing "--grow -n 3" resync job) recv METADATA_UPDATE + read disk metadata + raid1_error + set MD_RECOVERY_INTR to break resync ... ... md_check_recovery + remove_and_add_spares return 1 + set MD_RECOVERY_RECOVER, later restart resync mdadm md0 --remove sdg + md_cluster_ops->remove_disk | + send REMOVE + md_kick_rdev_from_array + update disk: array sb & bitmap sb (resync job blocking) (B continue doing "--grow -n 3" resync job) recv REMOVE + process_remove_disk doesn't set mddev->sb_flags, so it doesn't update disk sb & bitmap sb. ...... md_check_recovery + md_kick_rdev_from_array mdadm --grow -n 2 md0 + raid1_reshape | mddev->raid_disks: 3=>2 + send METADATA_UPDATED (B continue doing "--grow -n 3" resync job) recv METADATA_UPDATE + check_sb_changes update_raid_disks return -EBUSY update failed for mddev->raid_disks: 3=>2 (B never successfully update mddev->raid_disks: 3=>2) when B finish "--grow -n 3" resync job + use mddev->raid_disks:3 to update array sb & bitmap sb + send METADATA_UPDATED recv METADATA_UPDATED + read wrong raid_disks to update kernel data. ``` Signed-off-by: Zhao Heming <heming.zhao@xxxxxxxx> --- v2: - for clearly, split patch-v1 into two single patch to review. - revise patch subject & comments. - add test result in comments. v1: - add cover-letter - add more descriptions in commit log v0: - create 2 patches, patch 1/2 is this patch. --- drivers/md/md.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 98bac4f304ae..74280e353b8f 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks) return -EINVAL; if (mddev->sync_thread || test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || + test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) || mddev->reshape_position != MaxSector) return -EBUSY; @@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev) } } - if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) - update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); + if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) { + ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks)); + if (ret) + pr_warn("md: updating array disks failed. %d\n", ret); + } /* * Since mddev->delta_disks has already updated in update_raid_disks, -- 2.27.0