Re: [PATCH v3 1/2] md/cluster: reshape should returns error when remote doing resyncing job

Song Liu <song@xxxxxxxxxx> · Mon, 16 Nov 2020 01:05:09 -0800

On Sat, Nov 14, 2020 at 8:30 PM Zhao Heming <heming.zhao@xxxxxxxx> wrote:
>
[...]
>
> Signed-off-by: Zhao Heming <heming.zhao@xxxxxxxx>

The fix makes sense to me. But I really hope we can improve the commit log.
I have made some changes to it with a couple TODOs for you (see below).
Please read it, fill the TODOs, and revise 2/2.

Thanks,
Song

md/cluster: block reshape with remote resync job

Reshape request should be blocked with ongoing resync job. In cluster
env, a node can start resync job even if the resync cmd isn't executed
on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
will start resync job. However, current update_raid_disks() only check
local recovery status, which is incomplete. As a result, we see (TODO
describe observed issue).

Fix this issue by blocking reshape request. When node executes "--grow"
and detects ongoing resync, it should stop and report error to user.

The following script reproduces the issue with (TODO:  ???%) probability.
```
# on node1, node2 is the remote node.
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

Cc: <stable@xxxxxxxxxxxxxxx>
Signed-off-by: Zhao Heming <heming.zhao@xxxxxxxx>

> ---
>  drivers/md/md.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 98bac4f304ae..74280e353b8f 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
[...]