Re: [PATCH] md: don't create mddev in md_open

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 30, 2021 at 12:43 AM Zhao Heming <heming.zhao@xxxxxxxx> wrote:
>
> commit d3374825ce57 ("md: make devices disappear when they are no longer
> needed.") introduced protection between mddev creating & removing. The
> md_open shouldn't create mddev when all_mddevs list doesn't contain
> mddev. With currently code logic, there will be very easy to trigger
> soft lockup in non-preempt env.
>
> *** env ***
> kvm-qemu VM 2C1G with 2 iscsi luns
> kernel should be non-preempt
>
> *** script ***
>
> about trigger 1 time with 10 tests
>
> ```
> 1  node1="15sp3-mdcluster1"
> 2  node2="15sp3-mdcluster2"
> 3
> 4  mdadm -Ss
> 5  ssh ${node2} "mdadm -Ss"
> 6  wipefs -a /dev/sda /dev/sdb
> 7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
>    /dev/sdb --assume-clean
> 8
> 9  for i in {1..100}; do
> 10    echo ==== $i ====;
> 11
> 12    echo "test  ...."
> 13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
> 14    sleep 1
> 15
> 16    echo "clean  ....."
> 17    ssh ${node2} "mdadm -Ss"
> 18 done
> ```
>
> I use mdcluster env to trigger soft lockup, but it isn't mdcluster
> speical bug. To stop md array in mdcluster env will do more jobs than
> non-cluster array, which will leave enough time/gap to allow kernel to
> run md_open.
>
> *** stack ***
>
> ```
> PID: 2831   TASK: ffff8dd7223b5040  CPU: 0   COMMAND: "mdadm"
>  #0 [ffffa15d00a13b90] __schedule at ffffffffb8f1935f
>  #1 [ffffa15d00a13ba8] exact_lock at ffffffffb8a4a66d
>  #2 [ffffa15d00a13bb0] kobj_lookup at ffffffffb8c62fe3
>  #3 [ffffa15d00a13c28] __blkdev_get at ffffffffb89273b9
>  #4 [ffffa15d00a13c98] blkdev_get at ffffffffb8927964
>  #5 [ffffa15d00a13cb0] do_dentry_open at ffffffffb88dc4b4
>  #6 [ffffa15d00a13ce0] path_openat at ffffffffb88f0ccc
>  #7 [ffffa15d00a13db8] do_filp_open at ffffffffb88f32bb
>  #8 [ffffa15d00a13ee0] do_sys_open at ffffffffb88ddc7d
>  #9 [ffffa15d00a13f38] do_syscall_64 at ffffffffb86053cb
>     ffffffffb900008c
>
> or:
> [  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
> [  884.226515]  md_open+0x3c/0xe0 [md_mod]
> [  884.226518]  __blkdev_get+0x30d/0x710
> [  884.226520]  ? bd_acquire+0xd0/0xd0
> [  884.226522]  blkdev_get+0x14/0x30
> [  884.226524]  do_dentry_open+0x204/0x3a0
> [  884.226531]  path_openat+0x2fc/0x1520
> [  884.226534]  ? seq_printf+0x4e/0x70
> [  884.226536]  do_filp_open+0x9b/0x110
> [  884.226542]  ? md_release+0x20/0x20 [md_mod]
> [  884.226543]  ? seq_read+0x1d8/0x3e0
> [  884.226545]  ? kmem_cache_alloc+0x18a/0x270
> [  884.226547]  ? do_sys_open+0x1bd/0x260
> [  884.226548]  do_sys_open+0x1bd/0x260
> [  884.226551]  do_syscall_64+0x5b/0x1e0
> [  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> ```
>
> *** rootcause ***
>
> "mdadm -A" (or other array assemble commands) will start a daemon "mdadm
> --monitor" by default. When "mdadm -Ss" is running, the stop action will
> wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
> info from /proc/mdstat. This time mddev in kernel still exist, so
> /proc/mdstat still show md device, which makes "mdadm --monitor" to open
> /dev/md0.
>
> The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
> open action will trigger md_open which is creating action. Racing is
> happening.
>
> ```
> <thread 1>: "mdadm -Ss"
> md_release
>   mddev_put deletes mddev from all_mddevs
>   queue_work for mddev_delayed_delete
>   at this time, "/dev/md0" is still available for opening
>
> <thread 2>: "mdadm --monitor ..."
> md_open
>  + mddev_find can't find mddev of /dev/md0, and create a new mddev and
>  |    return.
>  + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
>       -ERESTARTSYS.
> ```
>
> In non-preempt kernel, <thread 2> is occupying on current CPU. and
> mddev_delayed_delete which was created in <thread 1> also can't be
> schedule.
>
> In preempt kernel, it can also trigger above racing. But kernel doesn't
> allow one thread running on a CPU all the time. after <thread 2> running
> some time, the later "mdadm -A" (refer above script line 13) will call
> md_alloc to alloc a new gendisk for mddev. it will break md_open
> statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
> the soft lockup is broken.
>
> Signed-off-by: Zhao Heming <heming.zhao@xxxxxxxx>
> ---
>  drivers/md/md.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 21da0c48f6c2..730d8570ad6d 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -734,7 +734,7 @@ void mddev_init(struct mddev *mddev)
>  }
>  EXPORT_SYMBOL_GPL(mddev_init);
>
> -static struct mddev *mddev_find(dev_t unit)
> +static struct mddev *mddev_find(dev_t unit, bool create)

"create" is added but never used? Wrong version?

Thanks,
Song


>  {
>         struct mddev *mddev, *new = NULL;
>
> @@ -5644,7 +5644,7 @@ static int md_alloc(dev_t dev, char *name)
>          * writing to /sys/module/md_mod/parameters/new_array.
>          */
>         static DEFINE_MUTEX(disks_mutex);
> -       struct mddev *mddev = mddev_find(dev);
> +       struct mddev *mddev = mddev_find(dev, true);
>         struct gendisk *disk;
>         int partitioned;
>         int shift;
> @@ -6523,7 +6523,7 @@ static void autorun_devices(int part)
>                 }
>
>                 md_probe(dev);
> -               mddev = mddev_find(dev);
> +               mddev = mddev_find(dev, true);
>                 if (!mddev || !mddev->gendisk) {
>                         if (mddev)
>                                 mddev_put(mddev);
> @@ -7807,7 +7807,7 @@ static int md_open(struct block_device *bdev, fmode_t mode)
>          * Succeed if we can lock the mddev, which confirms that
>          * it isn't being stopped right now.
>          */
> -       struct mddev *mddev = mddev_find(bdev->bd_dev);
> +       struct mddev *mddev = mddev_find(bdev->bd_dev, false);
>         int err;
>
>         if (!mddev)
> --
> 2.30.0
>



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux