Re: [PATCH 1/2] md-cluster: fix hanging issue while a new disk adding

Heming Zhao <heming.zhao@xxxxxxxx> · Tue, 25 Jun 2024 10:49:50 +0800

On 6/24/24 10:37, Yu Kuai wrote:
Hi,

在 2024/06/24 9:55, Heming Zhao 写道:
Hello Song & Kuai,

Xiao ni told me that he has been quite busy recently and cannot review
the code. Do you have time to review my code?

btw,
The patches has been passed the 60 loops of clustermd_tests [1]. because
the kernel md layer code changes, the clustermd_tests scripts also need
to be updated. I will send the clustermd_tests patch when the kernel
layer code passes review.

The tests will be quite important, since I'm not familiar with cluster
code here. Of coure I'll find sometime to review the code.

Thanks,
Kuai


Thanks for your reply. I post the set up for HA env here, if you have any
question please feel free to ask me.

--------
# How to set up clustered md env

(I use opensuse tumbleweed to test)

1. cook the patches

[PATCH 1/2] mdadm/clustermd_tests: add some apis in func.sh to support test to run without error
[PATCH 2/2] mdadm/clustermd_tests: adjust test cases to support md module changes
- https://lore.kernel.org/linux-raid/20240625021019.8732-1-heming.zhao@xxxxxxxx/T/#t

2. edit mdadm.git/clustermd_tests/cluster_conf

2.1 edit NODE[12], e.g:

```
NODE1=192.168.1.100
NODE2=192.168.1.101
```

2.2 edit 'devlist=', e.g:

```
devlist=/dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf
```

in my env, each sd[abcdef] size is 300MB

3. set up cluster env

download ISO:
https://mirrors.tuna.tsinghua.edu.cn/opensuse/tumbleweed/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20240621-Media.iso

install two VMs. add 6 shared disk (each 300MB), the backend is raw file.

4. set up ha

4.1 install softwares

because tumbleweed doesn't support "zypper -t pattern" mode to install HA stack softwares.

install HA stack command from opensuse:
 zypper in crmsh pacemaker corosync libcsync-plugin-sftp libvirt-client

4.2 set up ha

ref: https://documentation.suse.com/sle-ha/15-SP5/html/SLE-HA-all/article-installation.html

set up vm machine hostname:
node1: hostnamectl set-hostname tw-md1
node2: hostnamectl set-hostname tw-md2

edit /etc/hosts:
192.168.111.100 tw-md1
192.168.111.101 tw-md2

copy to another node:
scp -O /etc/hosts tw-md2:/etc/


on node1:
crm cluster init -S -y -i <node1-ip>

on node2:
crm cluster join -c <node1-ip>

on node1:
crm config edit (<== edit config as following)

```
INFO: "config" is accepted as "configure"
node 1: tw-md1
node 2: tw-md2
primitive dlm ocf:pacemaker:controld \
        op monitor interval=60s timeout=60s \
        op stop timeout=100s interval=0s \
        op monitor interval=30s timeout=90s
primitive stonith-libvirt stonith:external/libvirt \
        params hostlist="tb-md1,tb-md2" hypervisor_uri="qemu+tcp://<host-ip>/system" pcmk_delay_max=30s
group base-group dlm
clone base-clone base-group \
        meta interleave=true
property cib-bootstrap-options: \
        have-watchdog=true \
        dc-version="2.1.7+20240530.09c4d6d2e-1.2-2.1.7+20240530.09c4d6d2e" \
        cluster-infrastructure=corosync \
        cluster-name=hacluster \
        stonith-timeout=71 \
        stonith-enabled=true \
        no-quorum-policy=freeze
rsc_defaults build-resource-defaults: \
        resource-stickiness=1 \
        priority=1
```

note:
- hypervisor_uri="qemu+tcp://<host-ip>/system", the <host-ip> should be
  adjusted according to the real env.
- use 'crm config show' to show/check the config

check HA status:

crm status full (<== should show no error, see below)

```
Node List:
  * Node tw-md1: online:
    * Resources:
      * stonith-libvirt (stonith:external/libvirt):      Started
      * dlm     (ocf:pacemaker:controld):        Started
  * Node tw-md2: online:
    * Resources:
      * dlm     (ocf:pacemaker:controld):        Started
```

5. run test

all:
./test --testdir=clustermd_tests --save-logs --logdir=./logs --keep-going

single test:
./test --testdir=clustermd_tests --save-logs --logdir=./logs --keep-going --tests=02r1_Manage_add

Thanks,
Heming


[1]: https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/clustermd_tests

Thanks,
Heming

On 6/12/24 10:19, Heming Zhao wrote:
The commit 1bbe254e4336 ("md-cluster: check for timeout while a
new disk adding") is correct in terms of code syntax but not
suite real clustered code logic.

When a timeout occurs while adding a new disk, if recv_daemon()
bypasses the unlock for ack_lockres:CR, another node will be waiting
to grab EX lock. This will cause the cluster to hang indefinitely.

How to fix:

1. In dlm_lock_sync(), change the wait behaviour from forever to a
    timeout, This could avoid the hanging issue when another node
    fails to handle cluster msg. Another result of this change is
    that if another node receives an unknown msg (e.g. a new msg_type),
    the old code will hang, whereas the new code will timeout and fail.
    This could help cluster_md handle new msg_type from different
    nodes with different kernel/module versions (e.g. The user only
    updates one leg's kernel and monitors the stability of the new
    kernel).
2. The old code for __sendmsg() always returns 0 (success) under the
    design (must successfully unlock ->message_lockres). This commit
    makes this function return an error number when an error occurs.

Fixes: 1bbe254e4336 ("md-cluster: check for timeout while a new disk adding")
Signed-off-by: Heming Zhao <heming.zhao@xxxxxxxx>
Reviewed-by: Su Yue <glass.su@xxxxxxxx>
---
  drivers/md/md-cluster.c | 14 ++++++++++++--
  1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 8e36a0feec09..27eaaf9fef94 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -130,8 +130,13 @@ static int dlm_lock_sync(struct dlm_lock_resource *res, int mode)
              0, sync_ast, res, res->bast);
      if (ret)
          return ret;
-    wait_event(res->sync_locking, res->sync_locking_done);
+    ret = wait_event_timeout(res->sync_locking, res->sync_locking_done,
+                60 * HZ);
      res->sync_locking_done = false;
+    if (!ret) {
+        pr_err("locking DLM '%s' timeout!\n", res->name);
+        return -EBUSY;
+    }
      if (res->lksb.sb_status == 0)
          res->mode = mode;
      return res->lksb.sb_status;
@@ -744,12 +749,14 @@ static void unlock_comm(struct md_cluster_info *cinfo)
  static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
  {
      int error;
+    int ret = 0;
      int slot = cinfo->slot_number - 1;
      cmsg->slot = cpu_to_le32(slot);
      /*get EX on Message*/
      error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX);
      if (error) {
+        ret = error;
          pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error);
          goto failed_message;
      }
@@ -759,6 +766,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
      /*down-convert EX to CW on Message*/
      error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CW);
      if (error) {
+        ret = error;
          pr_err("md-cluster: failed to convert EX to CW on MESSAGE(%d)\n",
                  error);
          goto failed_ack;
@@ -767,6 +775,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
      /*up-convert CR to EX on Ack*/
      error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX);
      if (error) {
+        ret = error;
          pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n",
                  error);
          goto failed_ack;
@@ -775,6 +784,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
      /*down-convert EX to CR on Ack*/
      error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR);
      if (error) {
+        ret = error;
          pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n",
                  error);
          goto failed_ack;
@@ -789,7 +799,7 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
          goto failed_ack;
      }
  failed_message:
-    return error;
+    return ret;
  }
  static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,


.