Re: [PATCH] md-cluster: fix use-after-free issue when removing rdev

"heming.zhao@xxxxxxxx" <heming.zhao@xxxxxxxx> · Thu, 8 Apr 2021 15:12:57 +0800

On 4/8/21 2:45 PM, heming.zhao@xxxxxxxx wrote:
On 4/8/21 2:33 PM, Paul Menzel wrote:
Dear Heming,

Am 08.04.21 um 07:52 schrieb heming.zhao@xxxxxxxx:
On 4/8/21 1:09 PM, Paul Menzel wrote:

Am 08.04.21 um 05:01 schrieb Heming Zhao:
md_kick_rdev_from_array will remove rdev, so we should
use rdev_for_each_safe to search list.

How to trigger:

```
for i in {1..20}; do
     echo ==== $i `date` ====;

     mdadm -Ss && ssh ${node2} "mdadm -Ss"
     wipefs -a /dev/sda /dev/sdb

     mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l1 /dev/sda \
        /dev/sdb --assume-clean
     ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
     mdadm --wait /dev/md0
     ssh ${node2} "mdadm --wait /dev/md0"

     mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
     sleep 1
done
```

In the test script, I do not understand, what node2 is used for, where you log in over SSH.

The bug can only be triggered in cluster env. There are two nodes (in cluster),
To run this script on node1, and need ssh to node2 to execute some cmds.
${node2} stands for node2 ip address. e.g.: ssh 192.168.0.3 "mdadm --wait ..."

Please excuse my ignorance. I guess some other component is needed to connect the two RAID devices on each node? At least you never tell mdadm directly to use *node2*. Reading *Cluster Multi-device (Cluster MD)* [1] a resource agent is needed.

... ...
[1]: https://documentation.suse.com/sle-ha/12-SP4/html/SLE-HA-all/cha-ha-cluster-md.html

Your refer is right. this bug is cluster special and I also mentioned "md-cluster" in patch subject.
In my opinion, by default, people are interesting with this patch should have cluster md knowledge.
and I think the above script conatins enough info to show the reproducible steps.

Hello,

I will add more info about the test script in v2 patch.