Re: [PATCH] md-cluster: fix use-after-free issue when removing rdev

Paul Menzel <pmenzel@xxxxxxxxxxxxx> · Thu, 8 Apr 2021 08:33:54 +0200

Dear Heming,

Am 08.04.21 um 07:52 schrieb heming.zhao@xxxxxxxx:
On 4/8/21 1:09 PM, Paul Menzel wrote:

Am 08.04.21 um 05:01 schrieb Heming Zhao:
md_kick_rdev_from_array will remove rdev, so we should
use rdev_for_each_safe to search list.

How to trigger:

```
for i in {1..20}; do
     echo ==== $i `date` ====;

     mdadm -Ss && ssh ${node2} "mdadm -Ss"
     wipefs -a /dev/sda /dev/sdb

     mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
        /dev/sdb --assume-clean
     ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
     mdadm --wait /dev/md0
     ssh ${node2} "mdadm --wait /dev/md0"

     mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
     sleep 1
done
```

In the test script, I do not understand, what node2 is used for, where 
you log in over SSH.

The bug can only be triggered in cluster env. There are two nodes (in cluster),
To run this script on node1, and need ssh to node2 to execute some cmds.
${node2} stands for node2 ip address. e.g.: ssh 192.168.0.3 "mdadm 
--wait ..."

Please excuse my ignorance. I guess some other component is needed to 
connect the two RAID devices on each node? At least you never tell mdadm 
directly to use *node2*. Reading *Cluster Multi-device (Cluster MD)* [1] 
a resource agent is needed.

... ...

Signed-off-by: Heming Zhao <heming.zhao@xxxxxxxx>
Reviewed-by: Gang He <ghe@xxxxxxxx>

If there is a commit, your patch is fixing, please add a Fixes: tag.

OK, I forgot it, will send v2 patch later.

Awesome.

Kind regards,

Paul

[1]: 
https://documentation.suse.com/sle-ha/12-SP4/html/SLE-HA-all/cha-ha-cluster-md.html