Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs

Mariusz Tkaczyk <mariusz.tkaczyk@xxxxxxxxxxxxxxx> · Thu, 7 Sep 2023 14:41:53 +0200

On Thu, 7 Sep 2023 20:14:03 +0800
Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:

> Hi,
> 
> 在 2023/09/07 19:26, Yu Kuai 写道:
> > Hi,
> > 
> > 在 2023/09/07 18:18, Mariusz Tkaczyk 写道:  
> >> On Thu, 7 Sep 2023 10:04:11 +0800
> >> Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
> >>  
> >>> Hi,
> >>>
> >>> 在 2023/09/06 18:27, Mariusz Tkaczyk 写道:  
> >>>> On Wed, 6 Sep 2023 14:26:30 +0800
> >>>> AceLan Kao <acelan@xxxxxxxxx> wrote:  
> >>>>>   From previous testing, I don't think it's an issue in systemd, so I
> >>>>> did a simple test and found the issue is gone.
> >>>>> You only need to add a small delay in md_release(), then the issue
> >>>>> can't be reproduced.
> >>>>>
> >>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >>>>> index 78be7811a89f..ef47e34c1af5 100644
> >>>>> --- a/drivers/md/md.c
> >>>>> +++ b/drivers/md/md.c
> >>>>> @@ -7805,6 +7805,7 @@ static void md_release(struct gendisk *disk)
> >>>>> {
> >>>>>          struct mddev *mddev = disk->private_data;
> >>>>>
> >>>>> +       msleep(10);
> >>>>>          BUG_ON(!mddev);
> >>>>>          atomic_dec(&mddev->openers);
> >>>>>          mddev_put(mddev);  
> >>>>
> >>>> I have repro and I tested it on my setup. It is not working for me.
> >>>> My setup could be more "advanced" to maximalize chance of reproduction:
> >>>>
> >>>> # cat /proc/mdstat
> >>>> Personalities : [raid1] [raid6] [raid5] [raid4] [raid10] [raid0]
> >>>> md121 : active raid0 nvme2n1[1] nvme5n1[0]
> >>>>         7126394880 blocks super external:/md127/0 128k chunks
> >>>>
> >>>> md122 : active raid10 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
> >>>>         104857600 blocks super external:/md126/0 64K chunks 2 
> >>>> near-copies
> >>>> [4/4] [UUUU]
> >>>>
> >>>> md123 : active raid5 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
> >>>>         2655765504 blocks super external:/md126/1 level 5, 32k chunk,
> >>>> algorithm 0 [4/4] [UUUU]
> >>>>
> >>>> md124 : active raid1 nvme0n1[1] nvme3n1[0]
> >>>>         99614720 blocks super external:/md125/0 [2/2] [UU]
> >>>>
> >>>> md125 : inactive nvme3n1[1](S) nvme0n1[0](S)
> >>>>         10402 blocks super external:imsm
> >>>>
> >>>> md126 : inactive nvme7n1[3](S) nvme1n1[2](S) nvme6n1[1](S) 
> >>>> nvme4n1[0](S)
> >>>>         20043 blocks super external:imsm
> >>>>
> >>>> md127 : inactive nvme2n1[1](S) nvme5n1[0](S)
> >>>>         10402 blocks super external:imsm
> >>>>
> >>>> I have almost 99% repro ratio, slowly moving forward..
> >>>>
> >>>> It is endless loop because systemd-shutdown sends ioctl "stop_array" 
> >>>> which
> >>>> is successful but array is not stopped. For that reason it sets 
> >>>> "changed =
> >>>> true".  
> >>>
> >>> How does systemd-shutdown judge if array is stopped? cat /proc/mdstat or
> >>> ls /dev/md* or other way?  
> >>
> >> Hi Yu,
> >>
> >> It trusts return result, I confirmed that 0 is returned.
> >> The most weird is we are returning 0 but array is still there, and it is
> >> stopped again in next systemd loop. I don't understand why yet..
> >>  
> >>>> Systemd-shutdown see the change and retries to check if there is 
> >>>> something
> >>>> else which can be stopped now, and again, again...
> >>>>
> >>>> I will check what is returned first, it could be 0 or it could be 
> >>>> positive
> >>>> errno (nit?) because systemd cares "if(r < 0)".  
> >>>
> >>> I do noticed that there are lots of log about md123 stopped:
> >>>
> >>> [ 1371.834034] md122:systemd-shutdow bd_prepare_to_claim return -16
> >>> [ 1371.840294] md122:systemd-shutdow blkdev_get_by_dev return -16
> >>> [ 1371.846845] md: md123 stopped.
> >>> [ 1371.850155] md122:systemd-shutdow bd_prepare_to_claim return -16
> >>> [ 1371.856411] md122:systemd-shutdow blkdev_get_by_dev return -16
> >>> [ 1371.862941] md: md123 stopped.
> >>>
> >>> And md_ioctl->do_md_stop doesn't have error path after printing this
> >>> log, hence 0 will be returned to user.
> >>>
> >>> The normal case is that:
> >>>
> >>> open md123
> >>> ioctl STOP_ARRAY -> all rdev should be removed from array
> >>> close md123 -> mddev will finally be freed by:
> >>>     md_release
> >>>      mddev_put
> >>>       set_bit(MD_DELETED, &mddev->flags) -> user shound not see this 
> >>> mddev
> >>>       queue_work(md_misc_wq, &mddev->del_work)
> >>>
> >>>     mddev_delayed_delete
> >>>      kobject_put(&mddev->kobj)
> >>>
> >>>     md_kobj_release
> >>>      del_gendisk
> >>>       md_free_disk
> >>>        mddev_free
> >>>  
> >> Ok thanks, I understand that md_release is called on descriptor 
> >> closing, right?
> >>  
> > 
> > Yes, normally close md123 should drop that last reference.  
> >>  
> >>> Now that you can reporduce this problem 99%, can you dig deeper and find
> >>> out what is wrong?  
> >>
> >> Yes, working on it!
> >>
> >> My first idea was that mddev_get and mddev_put are missing on 
> >> md_ioctl() path
> >> but it doesn't help for the issue. My motivation here was that 
> >> md_attr_store and
> >> md_attr_show are using them.
> >>
> >> Systemd regenerates list of MD arrays on every loop and it is always
> >> there, systemd is able to open file descriptor (maybe inactive?).  
> > 
> > md123 should not be opended again, ioctl(STOP_ARRAY) already set the
> > flag 'MD_CLOSING' to prevent that. Are you sure that systemd-shutdown do
> > open and close the array in each loop?  
> 
> I realized that I'm wrong here. 'MD_CLOSING' is cleared before ioctl
> return by commit 065e519e71b2 ("md: MD_CLOSING needs to be cleared after
> called md_set_readonly or do_md_stop").
> 
> I'm confused here, commit message said 'MD_CLOSING' shold not be set for
> the case STOP_ARRAY_RO, but I don't understand why it's cleared for
> STOP_ARRAY as well.
> 
Not related with your findings but:

I replaced if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
because that is the way to exit without queuing work:

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0fe7ab6e8ab9..80bd7446be94 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -618,8 +618,7 @@ static void mddev_delayed_delete(struct work_struct *ws);

 

void mddev_put(struct mddev *mddev)
{
-       if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
-               return;
+       spin_lock(&all_mddevs_lock);
        if (!mddev->raid_disks && list_empty(&mddev->disks) &&
            mddev->ctime == 0 && !mddev->hold_active) {
                /* Array is not configured at all, and not held active,
@@ -634,6 +633,7 @@ void mddev_put(struct mddev *mddev)
                INIT_WORK(&mddev->del_work, mddev_delayed_delete);
                queue_work(md_misc_wq, &mddev->del_work);
        }
+       atomic_dec(&mddev->active);
        spin_unlock(&all_mddevs_lock);
}

After that I got kernel panic but it seems that workqueue is scheduled:

 51.535103] BUG: kernel NULL pointer dereference, address: 0000000000000008
[   51.539115] ------------[ cut here ]------------
[   51.543867] #PF: supervisor read access in kernel mode
1;[3 9 m S5t1a.r5tPF: error_code(0x0000) - not-present page
[   51.543875] PGD 0 P4D 0
.k[ 0 m5.1
 54ops: 0000 [#1] PREEMPT SMP NOPTI
[   51.552207] refcount_t: underflow; use-after-free.
[   51.556820] CPU: 19 PID: 368 Comm: kworker/19:1 Not tainted 6.5.0+ #57
[   51.556825] Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS
WLYDCRB1.SYS.0027.P82.2204080829 04/08/2022 [   51.561979] WARNING: CPU: 26
PID: 376 at lib/refcount.c:28 refcount_warn_saturate+0x99/0xe0 [   51.567273]
Workqueue:  mddev_delayed_delete [md_mod] [   51.569822] Modules linked in:
[   51.574351]  (events)
[   51.574353] RIP: 0010:process_one_work+0x10f/0x3d0
[   51.579155]  configfs

In my case, it seems to be IMSM container device is stopped in loop, which is an
inactive from the start. It is not something I'm totally sure but it could lead
us to the root cause. So far I know, the original reported uses IMSM arrays too.

Thanks,
Mariusz