On Thu, 7 Sep 2023 20:53:30 +0800 Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > Hi, > > 在 2023/09/07 20:41, Mariusz Tkaczyk 写道: > > On Thu, 7 Sep 2023 20:14:03 +0800 > > Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > > > >> Hi, > >> > >> 在 2023/09/07 19:26, Yu Kuai 写道: > >>> Hi, > >>> > >>> 在 2023/09/07 18:18, Mariusz Tkaczyk 写道: > >>>> On Thu, 7 Sep 2023 10:04:11 +0800 > >>>> Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> 在 2023/09/06 18:27, Mariusz Tkaczyk 写道: > >>>>>> On Wed, 6 Sep 2023 14:26:30 +0800 > >>>>>> AceLan Kao <acelan@xxxxxxxxx> wrote: > >>>>>>> From previous testing, I don't think it's an issue in systemd, so I > >>>>>>> did a simple test and found the issue is gone. > >>>>>>> You only need to add a small delay in md_release(), then the issue > >>>>>>> can't be reproduced. > >>>>>>> > >>>>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c > >>>>>>> index 78be7811a89f..ef47e34c1af5 100644 > >>>>>>> --- a/drivers/md/md.c > >>>>>>> +++ b/drivers/md/md.c > >>>>>>> @@ -7805,6 +7805,7 @@ static void md_release(struct gendisk *disk) > >>>>>>> { > >>>>>>> struct mddev *mddev = disk->private_data; > >>>>>>> > >>>>>>> + msleep(10); > >>>>>>> BUG_ON(!mddev); > >>>>>>> atomic_dec(&mddev->openers); > >>>>>>> mddev_put(mddev); > >>>>>> > >>>>>> I have repro and I tested it on my setup. It is not working for me. > >>>>>> My setup could be more "advanced" to maximalize chance of reproduction: > >>>>>> > >>>>>> # cat /proc/mdstat > >>>>>> Personalities : [raid1] [raid6] [raid5] [raid4] [raid10] [raid0] > >>>>>> md121 : active raid0 nvme2n1[1] nvme5n1[0] > >>>>>> 7126394880 blocks super external:/md127/0 128k chunks > >>>>>> > >>>>>> md122 : active raid10 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0] > >>>>>> 104857600 blocks super external:/md126/0 64K chunks 2 > >>>>>> near-copies > >>>>>> [4/4] [UUUU] > >>>>>> > >>>>>> md123 : active raid5 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0] > >>>>>> 2655765504 blocks super external:/md126/1 level 5, 32k chunk, > >>>>>> algorithm 0 [4/4] [UUUU] > >>>>>> > >>>>>> md124 : active raid1 nvme0n1[1] nvme3n1[0] > >>>>>> 99614720 blocks super external:/md125/0 [2/2] [UU] > >>>>>> > >>>>>> md125 : inactive nvme3n1[1](S) nvme0n1[0](S) > >>>>>> 10402 blocks super external:imsm > >>>>>> > >>>>>> md126 : inactive nvme7n1[3](S) nvme1n1[2](S) nvme6n1[1](S) > >>>>>> nvme4n1[0](S) > >>>>>> 20043 blocks super external:imsm > >>>>>> > >>>>>> md127 : inactive nvme2n1[1](S) nvme5n1[0](S) > >>>>>> 10402 blocks super external:imsm > >>>>>> > >>>>>> I have almost 99% repro ratio, slowly moving forward.. > >>>>>> > >>>>>> It is endless loop because systemd-shutdown sends ioctl "stop_array" > >>>>>> which > >>>>>> is successful but array is not stopped. For that reason it sets > >>>>>> "changed = > >>>>>> true". > >>>>> > >>>>> How does systemd-shutdown judge if array is stopped? cat /proc/mdstat or > >>>>> ls /dev/md* or other way? > >>>> > >>>> Hi Yu, > >>>> > >>>> It trusts return result, I confirmed that 0 is returned. > >>>> The most weird is we are returning 0 but array is still there, and it is > >>>> stopped again in next systemd loop. I don't understand why yet.. > >>>> > >>>>>> Systemd-shutdown see the change and retries to check if there is > >>>>>> something > >>>>>> else which can be stopped now, and again, again... > >>>>>> > >>>>>> I will check what is returned first, it could be 0 or it could be > >>>>>> positive > >>>>>> errno (nit?) because systemd cares "if(r < 0)". > >>>>> > >>>>> I do noticed that there are lots of log about md123 stopped: > >>>>> > >>>>> [ 1371.834034] md122:systemd-shutdow bd_prepare_to_claim return -16 > >>>>> [ 1371.840294] md122:systemd-shutdow blkdev_get_by_dev return -16 > >>>>> [ 1371.846845] md: md123 stopped. > >>>>> [ 1371.850155] md122:systemd-shutdow bd_prepare_to_claim return -16 > >>>>> [ 1371.856411] md122:systemd-shutdow blkdev_get_by_dev return -16 > >>>>> [ 1371.862941] md: md123 stopped. > >>>>> > >>>>> And md_ioctl->do_md_stop doesn't have error path after printing this > >>>>> log, hence 0 will be returned to user. > >>>>> > >>>>> The normal case is that: > >>>>> > >>>>> open md123 > >>>>> ioctl STOP_ARRAY -> all rdev should be removed from array > >>>>> close md123 -> mddev will finally be freed by: > >>>>> md_release > >>>>> mddev_put > >>>>> set_bit(MD_DELETED, &mddev->flags) -> user shound not see this > >>>>> mddev > >>>>> queue_work(md_misc_wq, &mddev->del_work) > >>>>> > >>>>> mddev_delayed_delete > >>>>> kobject_put(&mddev->kobj) > >>>>> > >>>>> md_kobj_release > >>>>> del_gendisk > >>>>> md_free_disk > >>>>> mddev_free > >>>>> > >>>> Ok thanks, I understand that md_release is called on descriptor > >>>> closing, right? > >>>> > >>> > >>> Yes, normally close md123 should drop that last reference. > >>>> > >>>>> Now that you can reporduce this problem 99%, can you dig deeper and find > >>>>> out what is wrong? > >>>> > >>>> Yes, working on it! > >>>> > >>>> My first idea was that mddev_get and mddev_put are missing on > >>>> md_ioctl() path > >>>> but it doesn't help for the issue. My motivation here was that > >>>> md_attr_store and > >>>> md_attr_show are using them. > >>>> > >>>> Systemd regenerates list of MD arrays on every loop and it is always > >>>> there, systemd is able to open file descriptor (maybe inactive?). > >>> > >>> md123 should not be opended again, ioctl(STOP_ARRAY) already set the > >>> flag 'MD_CLOSING' to prevent that. Are you sure that systemd-shutdown do > >>> open and close the array in each loop? > >> > >> I realized that I'm wrong here. 'MD_CLOSING' is cleared before ioctl > >> return by commit 065e519e71b2 ("md: MD_CLOSING needs to be cleared after > >> called md_set_readonly or do_md_stop"). > >> > >> I'm confused here, commit message said 'MD_CLOSING' shold not be set for > >> the case STOP_ARRAY_RO, but I don't understand why it's cleared for > >> STOP_ARRAY as well. > >> > > Can you try the following patch? > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 3afd57622a0b..31b9cec7e4c0 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -7668,7 +7668,8 @@ static int md_ioctl(struct block_device *bdev, > fmode_t mode, > err = -EBUSY; > goto out; > } > - did_set_md_closing = true; > + if (cmd == STOP_ARRAY_RO) > + did_set_md_closing = true; > mutex_unlock(&mddev->open_mutex); > sync_blockdev(bdev); > } > > I think prevent array to be opened again after STOP_ARRAY might fix > this. I didn't help. I tried much more: diff --git a/drivers/md/md.c b/drivers/md/md.c index 0fe7ab6e8ab9..807387f37755 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -608,7 +608,7 @@ static inline struct mddev *mddev_get(struct mddev *mddev) { lockdep_assert_held(&all_mddevs_lock); - if (test_bit(MD_DELETED, &mddev->flags)) + if (test_bit(MD_DELETED, &mddev->flags) || test_bit(MD_CLOSING, &mddev->flags)) return NULL; atomic_inc(&mddev->active); return mddev; Issue is still reproducible. I was shocked so I removed clearing MD_CLOSING: out: - if(did_set_md_closing) - clear_bit(MD_CLOSING, &mddev->flags); + //if(did_set_md_closing) + // clear_bit(MD_CLOSING, &mddev->flags); Still reproducible. Then I found this: @@ -6177,7 +6179,7 @@ static void md_clean(struct mddev *mddev) mddev->persistent = 0; mddev->level = LEVEL_NONE; mddev->clevel[0] = 0; - mddev->flags = 0; + //mddev->flags = 0; mddev->sb_flags = 0; mddev->ro = MD_RDWR; mddev->metadata_type[0] = 0; Issue not reproducible. I can see in log that attempt to stop device is not repeated. Well, with the change of using time-sensitive lock in mddev_put we need to ensure that we can still can remove mddevice so setting and preserving MD_CLOSING is reasonable for me. I will combine this into patches tomorrow. Yu, really appreciate your help! Glad to heave you here. Thanks, Mariusz