Re: [PATCH 1/2] md/cluster: reshape should returns error when remote doing resyncing job

Song Liu <song@xxxxxxxxxx> · Mon, 9 Nov 2020 10:01:29 -0800

On Sun, Nov 8, 2020 at 6:53 AM Zhao Heming <heming.zhao@xxxxxxxx> wrote:
>
[...]

> How to fix:
>
> The simple & clear solution is block the reshape action in initiator
> side. When node is executing "--grow" and detecting there is ongoing
> resyncing, it should immediately return & report error to user space.
>
> Signed-off-by: Zhao Heming <heming.zhao@xxxxxxxx>

The code looks good to me. But please revise the commit log as something
similar to the following.

========================== 8< ==========================
md/cluster: block reshape requests with resync job initiated from remote node

In cluster env, a node can start resync job when the resync cmd was executed
on a different node. Reshape requests should be blocked for resync job initiated
by any node. Current code only <condition to block reshape requests>.
This results
in a dead lock in <condition> (see repro steps below). Fix this by <adding the
extra check>.

Repro steps:
...
========================== 8< ==========================

In this way, whoever reading the commit log, which could be yourself in 2021,
will understand the primary goal of this change quickly.

Does this make sense?

Thanks,
Song

> ---
>  drivers/md/md.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 98bac4f304ae..74280e353b8f 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7278,6 +7278,7 @@ static int update_raid_disks(struct mddev *mddev, int raid_disks)
>                 return -EINVAL;
>         if (mddev->sync_thread ||
>             test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
> +               test_bit(MD_RESYNCING_REMOTE, &mddev->recovery) ||
>             mddev->reshape_position != MaxSector)
>                 return -EBUSY;
>
> @@ -9662,8 +9663,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
>                 }
>         }
>
> -       if (mddev->raid_disks != le32_to_cpu(sb->raid_disks))
> -               update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
> +       if (mddev->raid_disks != le32_to_cpu(sb->raid_disks)) {
> +               ret = update_raid_disks(mddev, le32_to_cpu(sb->raid_disks));
> +               if (ret)
> +                       pr_warn("md: updating array disks failed. %d\n", ret);
> +       }
>
>         /*
>          * Since mddev->delta_disks has already updated in update_raid_disks,
> --
> 2.27.0
>