Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs

Mariusz Tkaczyk <mariusz.tkaczyk@xxxxxxxxxxxxxxx> · Thu, 31 Aug 2023 08:50:57 +0200

On Thu, 31 Aug 2023 10:28:44 +0800
Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:

> Hi, Acelan,
> 
> 在 2023/08/28 21:50, Yu Kuai 写道:
> > Hi,
> > 
> > 在 2023/08/28 13:20, Song Liu 写道:  
> >> Hi AceLan,
> >>
> >> Thanks for running the experiments.
> >>
> >> On Fri, Aug 25, 2023 at 9:32 PM AceLan Kao <acelan@xxxxxxxxx> wrote:
> >> [...]  
> >>>>
> >>>> Could you please run the follow two experiments?
> >>>>
> >>>> 1. Confirm 12a6caf273240a triggers this. Specifically:
> >>>>     git checkout 12a6caf273240a => repros
> >>>>     git checkout 12a6caf273240a~1 => cannot repro  
> >>> Yes, I'm pretty sure about this, that's my bisect result and I just
> >>> confirmed it again.
> >>> I also tried reverting 12a6caf273240a and the issue is gone.  
> >>
> >> The log doesn't match my guess. Specifically:
> >>
> >> [  420.068142] systemd-shutdown[1]: Stopping MD /dev/md123 (9:123).
> >> [  420.074718] md_open:md123 openers++ = 1 by systemd-shutdow
> >> [  420.080787] systemd-shutdown[1]: Failed to sync MD block device
> >> /dev/md123, ignoring: Input/output error
> >> [  420.090831] md: md123 stopped.
> >> [  420.094465] systemd-shutdown[1]: Stopping MD /dev/md122 (9:122).
> >> [  420.101045] systemd-shutdown[1]: Could not stop MD /dev/md122:
> >> Device or resource busy  
> > 
> > I see that:
> > 
> > systemd-shutdown[1]: Couldn't finalize remaining  MD devices, trying again.
> > 
> > Can we make sure is this why power off hang?
> > 
> > Because in my VM, power off is not hang and I got:
> > 
> > systemd-shutdown[1]: Could not stop MD /dev/md1: Device or resource busy
> > systemd-shutdown[1]: Failed to finalize MD devices, ignoring.  
> >>
> >> For a successful stop on md123, we reach the pr_info() in md_open().
> >> For a failed stop on md122, the kernel returns -EBUSY before that
> >> pr_info() in md_open(). There are some changes in md_open() in
> >> the past few release, so I am not quite sure we are looking at the
> >> same code.  
> > 
> > By the way, based on code review, looks like md_open never return
> > -EBUSY, and I think following is the only place can return -EBUSY before
> > md_open() is called:
> > 
> > blkdev_open
> >   blkdev_get_by_dev
> >    bd_prepare_to_claim
> >     bd_may_claim     -> -EBUSY
> > 
> > Acelan, can you add following debug patch on the top of Song's patch
> > and reporduce it again? Hope it'll confirm why stop array failed with
> > -EBUSY.
> >   
> 
> I saw your reply:
> 
> https://lore.kernel.org/all/CAMz9Wg9VXJzZkKWCosm0_=Rz6gG9r3pCMrhvBZQVCQb8i0kd-w@xxxxxxxxxxxxxx/
> 
> But I didn't recieve you emails, please reply to me through this email
> address: yukuai3@xxxxxxxxxx. yukuai1@xxxxxxxxxxxxxxx is just used for
> sending, and I always missing emails from huaweicloud.com, my apologize
> for this inconvenience, I have to do it this way due to my company
> policy.
> 
> There is something interesting on test result:
> 
> At first, md122 is the root device, I guess this is why I can't repoduce
> this problem in my local VM.
> [   14.478619] systemd-gpt-auto-generator[585]: md122p3: Root device 
> /dev/md122.
> [   14.481823] block md122: the capability attribute has been deprecated.
> 
> Then, before trying to stop md122, following log shows that md122 is
> still mounted:
> [ 1368.369926] [3462]: Remounting '/' read-only with options 
> 'errors=remount-ro,stripe=128'.
> [ 1368.396287] EXT4-fs (md122p3): re-mounted 
> 6d53dc8e-3f45-4efa-bc0e-4af477fac217 ro. Quota mode: none.
> 
> And since then, there is no log aboug md_open(), which means no one
> ever open md122 since this remount.
> 
> At last, there are lots of log shows that md122 can't be opened
> exclusively:
> [ 1368.812952] md122:systemd-shutdow bd_prepare_to_claim return -16
> [ 1368.819189] md122:systemd-shutdow blkdev_get_by_dev return -16
>
> Which indicate that md122 is still mounted, and systemd-shutdow can't
> stop this array. This behaviour looks correct to me, because rootfs
> can't be umounted while systemd-shutdown is still running.
> 
> However, there are some questions I don't have answers for now, I'm not
> familiar with how systemd works:
> 
> 1) In the normal case that raid is root device, how can systemd-shutdown
> make progress and how is rootfs unmounted?

Thanks for digging into!

In that case systemd-shutdown should skip this device. It should be umounted by
dracut. systemd-shutdown implements naive way to try stop MD devices, simple
ioctl(fd, STOP_ARRAY, NULL).
We are expecting failure if resource is still mounted.

Here dracut code:
https://github.com/dracutdevs/dracut/blob/master/modules.d/90mdraid/md-shutdown.sh

Ohh, it is ubuntu so it is handled by initramfs-tools not by dracut, but I hope
you will find it useful for the future. 

and system-shutdown:
https://github.com/systemd/systemd/blob/main/src/shutdown/shutdown.c#L483C7-L483C7

When the functionality in systemd-shutdown was fixed by me, I asked if we can
make it configurable- Lennart disagreed so there is now way to disable it now. 

> 2) How does this scenario related to commit 12a6caf273240a.

I'm unable to reproduce it on Redhat, Ubuntu is on the way now.
I'm starting to be convinced that it could be related with initramfs-tools but
I cannot prove this theory yet. I need to reproduce the issue first.

Thanks,
Mariusz
> 
> I must read systemd source code to get answers, and it'll be appreciated
> if someone has answers.
> 
> Thanks,
> Kuai
> 
> > diff --git a/block/bdev.c b/block/bdev.c
> > index 979e28a46b98..699739223dcb 100644
> > --- a/block/bdev.c
> > +++ b/block/bdev.c
> > @@ -789,8 +789,11 @@ struct block_device *blkdev_get_by_dev(dev_t dev, 
> > blk_mode_t mode, void *holder,
> >          if (holder) {
> >                  mode |= BLK_OPEN_EXCL;
> >                  ret = bd_prepare_to_claim(bdev, holder, hops);
> > -               if (ret)
> > +               if (ret) {
> > +                       pr_warn("%s:%s bd_prepare_to_claim return %d\n",
> > +                               disk->disk_name, current->comm, ret);
> >                          goto put_blkdev;
> > +               }
> >          } else {
> >                  if (WARN_ON_ONCE(mode & BLK_OPEN_EXCL)) {
> >                          ret = -EIO;
> > diff --git a/block/fops.c b/block/fops.c
> > index eaa98a987213..2d69119c71f6 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -587,8 +587,11 @@ static int blkdev_open(struct inode *inode, struct 
> > file *filp)
> > 
> >          bdev = blkdev_get_by_dev(inode->i_rdev, file_to_blk_mode(filp),
> >                                   filp->private_data, NULL);
> > -       if (IS_ERR(bdev))
> > +       if (IS_ERR(bdev)) {
> > +               pr_warn("%pD:%s blkdev_get_by_dev return %ld\n",
> > +                       filp, current->comm, PTR_ERR(bdev));
> >                  return PTR_ERR(bdev);
> > +       }
> > 
> >          if (bdev_nowait(bdev))
> >                  filp->f_mode |= FMODE_NOWAIT;
> > 
> > Thanks,
> > Kuai
> >   
> >>
> >> Therefore, could you please help clarify:
> >>
> >> 1. Which base kernel are you using?
> >>  
> >>> From the log, you are using 6.5-rc7-706a74159504. However,  
> >> I think we cannot cleanly revert 12a6caf273240a on top of
> >> 6.5-rc7-706a74159504. Did you manually fix some issue in the
> >> revert? If so, could you please share the revert commit?
> >>
> >> 2. If you are not using 6.5-rc7-706a74159504 as base kernel, which
> >> one are you using?
> >>
> >> Thanks,
> >> Song
> >>  
> >>>  
> >>>>
> >>>> 2. Try with the following change (add debug messages), which hopefully
> >>>>     shows which command is holding a reference on mddev->openers.
> >>>>
> >>>> Thanks,
> >>>> Song
> >>>>
> >>>> diff --git i/drivers/md/md.c w/drivers/md/md.c
> >>>> index 78be7811a89f..3e9b718b32c1 100644
> >>>> --- i/drivers/md/md.c
> >>>> +++ w/drivers/md/md.c
> >>>> @@ -7574,11 +7574,15 @@ static int md_ioctl(struct block_device *bdev,
> >>>> blk_mode_t mode,
> >>>>                  if (mddev->pers && atomic_read(&mddev->openers) > 1) {
> >>>>                          mutex_unlock(&mddev->open_mutex);
> >>>>                          err = -EBUSY;
> >>>> +                       pr_warn("%s return -EBUSY for %s with
> >>>> mddev->openers = %d\n",
> >>>> +                               __func__, mdname(mddev),
> >>>> atomic_read(&mddev->openers));
> >>>>                          goto out;
> >>>>                  }
> >>>>                  if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
> >>>>                          mutex_unlock(&mddev->open_mutex);
> >>>>                          err = -EBUSY;
> >>>> +                       pr_warn("%s return -EBUSY for %s with
> >>>> MD_CLOSING bit set\n",
> >>>> +                               __func__, mdname(mddev));
> >>>>                          goto out;
> >>>>                  }
> >>>>                  did_set_md_closing = true;
> >>>> @@ -7789,6 +7793,8 @@ static int md_open(struct gendisk *disk, 
> >>>> blk_mode_t mode)
> >>>>                  goto out_unlock;
> >>>>
> >>>>          atomic_inc(&mddev->openers);
> >>>> +       pr_info("%s:%s openers++ = %d by %s\n", __func__, 
> >>>> mdname(mddev),
> >>>> +               atomic_read(&mddev->openers), current->comm);
> >>>>          mutex_unlock(&mddev->open_mutex);
> >>>>
> >>>>          disk_check_media_change(disk);
> >>>> @@ -7807,6 +7813,8 @@ static void md_release(struct gendisk *disk)
> >>>>
> >>>>          BUG_ON(!mddev);
> >>>>          atomic_dec(&mddev->openers);
> >>>> +       pr_info("%s:%s openers-- = %d by %s\n", __func__, 
> >>>> mdname(mddev),
> >>>> +               atomic_read(&mddev->openers), current->comm);
> >>>>          mddev_put(mddev);
> >>>>   }  
> >>> It's pretty strange that I can't reproduce the issue after applied 
> >>> the patch.
> >>>
> >>> I tried to figure out which part affect the issue and found when I
> >>> comment out the pr_info() In md_release(), the issue could be
> >>> reproduced.
> >>>
> >>> -- 
> >>> Chia-Lin Kao(AceLan)
> >>> http://blog.acelan.idv.tw/
> >>> E-Mail: acelan.kaoATcanonical.com (s/AT/@/)  
> >> .
> >>  
> > 
> > .
> >   
>