commit d3374825ce57 ("md: make devices disappear when they are no longer needed.") introduced protection between mddev creating & removing. The md_open shouldn't create mddev when all_mddevs list doesn't contain mddev. With currently code logic, there will be very easy to trigger soft lockup in non-preempt env. *** env *** kvm-qemu VM 2C1G with 2 iscsi luns kernel should be non-preempt *** script *** about trigger every time with below script ``` 1 node1="mdcluster1" 2 node2="mdcluster2" 3 4 mdadm -Ss 5 ssh ${node2} "mdadm -Ss" 6 wipefs -a /dev/sda /dev/sdb 7 mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \ /dev/sdb --assume-clean 8 9 for i in {1..10}; do 10 echo ==== $i ====; 11 12 echo "test ...." 13 ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb" 14 sleep 1 15 16 echo "clean ....." 17 ssh ${node2} "mdadm -Ss" 18 done ``` I use mdcluster env to trigger soft lockup, but it isn't mdcluster speical bug. To stop md array in mdcluster env will do more jobs than non-cluster array, which will leave enough time/gap to allow kernel to run md_open. *** stack *** ``` [ 884.226509] mddev_put+0x1c/0xe0 [md_mod] [ 884.226515] md_open+0x3c/0xe0 [md_mod] [ 884.226518] __blkdev_get+0x30d/0x710 [ 884.226520] ? bd_acquire+0xd0/0xd0 [ 884.226522] blkdev_get+0x14/0x30 [ 884.226524] do_dentry_open+0x204/0x3a0 [ 884.226531] path_openat+0x2fc/0x1520 [ 884.226534] ? seq_printf+0x4e/0x70 [ 884.226536] do_filp_open+0x9b/0x110 [ 884.226542] ? md_release+0x20/0x20 [md_mod] [ 884.226543] ? seq_read+0x1d8/0x3e0 [ 884.226545] ? kmem_cache_alloc+0x18a/0x270 [ 884.226547] ? do_sys_open+0x1bd/0x260 [ 884.226548] do_sys_open+0x1bd/0x260 [ 884.226551] do_syscall_64+0x5b/0x1e0 [ 884.226554] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ``` *** rootcause *** "mdadm -A" (or other array assemble commands) will start a daemon "mdadm --monitor" by default. When "mdadm -Ss" is running, the stop action will wakeup "mdadm --monitor". The "--monitor" daemon will immediately get info from /proc/mdstat. This time mddev in kernel still exist, so /proc/mdstat still show md device, which makes "mdadm --monitor" to open /dev/md0. The previously "mdadm -Ss" is removing action, the "mdadm --monitor" open action will trigger md_open which is creating action. Racing is happening. ``` <thread 1>: "mdadm -Ss" md_release mddev_put deletes mddev from all_mddevs queue_work for mddev_delayed_delete at this time, "/dev/md0" is still available for opening <thread 2>: "mdadm --monitor ..." md_open + mddev_find can't find mddev of /dev/md0, and create a new mddev and | return. + trigger "if (mddev->gendisk != bdev->bd_disk)" and return -ERESTARTSYS. ``` In non-preempt kernel, <thread 2> is occupying on current CPU. and mddev_delayed_delete which was created in <thread 1> also can't be schedule. In preempt kernel, it can also trigger above racing. But kernel doesn't allow one thread running on a CPU all the time. after <thread 2> running some time, the later "mdadm -A" (refer above script line 13) will call md_alloc to alloc a new gendisk for mddev. it will break md_open statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller, the soft lockup is broken. Signed-off-by: Zhao Heming <heming.zhao@xxxxxxxx> --- v2: - add missed code in mddev_find - modify commit comment - remove SUSE product name in test script - soft lockup trigger time is incorrect, fix it. - only leave one soft lockup stack v1: - create patch --- drivers/md/md.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 21da0c48f6c2..fe6d9f6e3c3b 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -734,7 +734,7 @@ void mddev_init(struct mddev *mddev) } EXPORT_SYMBOL_GPL(mddev_init); -static struct mddev *mddev_find(dev_t unit) +static struct mddev *mddev_find(dev_t unit, bool create) { struct mddev *mddev, *new = NULL; @@ -793,7 +793,8 @@ static struct mddev *mddev_find(dev_t unit) } spin_unlock(&all_mddevs_lock); - new = kzalloc(sizeof(*new), GFP_KERNEL); + if (create) + new = kzalloc(sizeof(*new), GFP_KERNEL); if (!new) return NULL; @@ -5644,7 +5645,7 @@ static int md_alloc(dev_t dev, char *name) * writing to /sys/module/md_mod/parameters/new_array. */ static DEFINE_MUTEX(disks_mutex); - struct mddev *mddev = mddev_find(dev); + struct mddev *mddev = mddev_find(dev, true); struct gendisk *disk; int partitioned; int shift; @@ -6523,7 +6524,7 @@ static void autorun_devices(int part) } md_probe(dev); - mddev = mddev_find(dev); + mddev = mddev_find(dev, true); if (!mddev || !mddev->gendisk) { if (mddev) mddev_put(mddev); @@ -7807,7 +7808,7 @@ static int md_open(struct block_device *bdev, fmode_t mode) * Succeed if we can lock the mddev, which confirms that * it isn't being stopped right now. */ - struct mddev *mddev = mddev_find(bdev->bd_dev); + struct mddev *mddev = mddev_find(bdev->bd_dev, false); int err; if (!mddev) -- 2.30.0