On Sat, Mar 15, 2025 at 02:41:43PM -0400, James Bottomley wrote: > On Sat, 2025-03-15 at 11:04 +0100, Christian Brauner wrote: > [...] > > Since efivars uses a single global superblock and we know that sfi- > > >sb is still alive (After all we've just pinned it above.) > > vfs_kern_mount() will reuse the same superblock. > > > > There's two cases to consider: > > > > (1) vfs_kern_mount() was successful. In this case path->mnt will hold > > an active superblock reference that will be released asynchronously > > via __fput(). That is safe and correct. > > > > (2) vfs_kern_mount() fails. That's an issue because you need to call > > deactivate_super() which will have a similar deadlock problem. > > If efivarfs_pm_notify() now holds the last reference to the > > superblock then deactivate_super() super will put that last > > reference and call efivarfs_kill_super() which in turn will wait for > > efivarfs_pm_notify() to finish. => deadlock > > > > So in the error case you need to offload the call to > > deactivate_super() to a workqueue. > > OK, got that (although it did make my head explode a bit ... this is > certainly subtle stuff). To do the delayed work for the > deactivate_super(), I hijacked the superblock destroy_work structure > which I think is safe because by the time the work structure is > executed, we own it and so it can be reused for the actual destroy_work > in deactivate_super(). > > However, there's another problem: the mntput after kernel_file_open may > synchronously call cleanup_mnt() (and thus deactivate_super()) if the > open fails because it's marked MNT_INTERNAL, which is caused by > SB_KERNMOUNT. I fixed this just by not passing the SB_KERNMOUNT flag, > which feels a bit hacky. It actually isn't. We know that vfs_kern_mount() will always resurface the single superblock that's exposed to userspace because we've just taken a reference to it earlier in efivarfs_pm_notify(). So that SB_KERNMOUNT flag is ignored because no new superblock is allocated. It would only matter if we'd end up allocating a new superblock which we never do. And if we did it would be a bug because the superblock we allocate could be reused at any time if a userspace task mounts efivarfs before efivarfs_pm_notify() has destroyed it (or the respective workqueue). But that superblock would then have SB_KERNMOUNT for something that's not supposed to be one. And whether or not that helper mount has MNT_INTERNAL is immaterial to what you're doing here afaict. So not passing the SB_KERNMOUNT flag is the right thing (see devtmpfs as well). You could slap a comment in here explaining that we never allocate a new superblock so it's clear to people not familiar with this particular code. > > I've put together everything at the bottom, however, I can't test the > error legs of this because trying to trigger and unmount and hybernate > at exactly the right point is pretty much impossible. The rest seems > to work as advertised, although I would like a tested-by from the > apparmour people because I do run apparmour in my debian test rig but > don't see the problem. > > Regards, > > James > > --- > > diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c > index 6eae8cf655c1..2d826e98066b 100644 > --- a/fs/efivarfs/super.c > +++ b/fs/efivarfs/super.c > @@ -474,12 +474,25 @@ static int efivarfs_check_missing(efi_char16_t *name16, efi_guid_t vendor, > return err; > } > > +static void efivarfs_deactivate_super_work(struct work_struct *work) > +{ > + struct super_block *s = container_of(work, struct super_block, > + destroy_work); > + /* > + * note: here s->destroy_work is free for reuse (which > + * will happen in deactivate_super) > + */ > + deactivate_super(s); > +} > + > +static struct file_system_type efivarfs_type; > + > static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action, > void *ptr) > { > struct efivarfs_fs_info *sfi = container_of(nb, struct efivarfs_fs_info, > pm_nb); > - struct path path = { .mnt = NULL, .dentry = sfi->sb->s_root, }; > + struct path path; > struct efivarfs_ctx ectx = { > .ctx = { > .actor = efivarfs_actor, > @@ -487,6 +500,7 @@ static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action, > .sb = sfi->sb, > }; > struct file *file; > + struct super_block *s = sfi->sb; > static bool rescan_done = true; > > if (action == PM_HIBERNATION_PREPARE) { > @@ -499,11 +513,39 @@ static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action, > if (rescan_done) > return NOTIFY_DONE; > > + /* ensure single superblock is alive and pin it */ > + if (!atomic_inc_not_zero(&s->s_active)) > + return NOTIFY_DONE; > + > pr_info("efivarfs: resyncing variable state\n"); > > - /* O_NOATIME is required to prevent oops on NULL mnt */ > + path.dentry = sfi->sb->s_root; > + > + /* do not add SB_KERNMOUNT which causes MNT_INTERNAL, see below */ > + path.mnt = vfs_kern_mount(&efivarfs_type, 0, > + efivarfs_type.name, NULL); > + if (IS_ERR(path.mnt)) { > + pr_err("efivarfs: internal mount failed\n"); > + /* > + * We may be the last pinner of the superblock but > + * calling efivarfs_kill_sb from within the notifier > + * here would deadlock trying to unregister it > + */ > + INIT_WORK(&s->destroy_work, efivarfs_deactivate_super_work); > + schedule_work(&s->destroy_work); > + return PTR_ERR(path.mnt); > + } > + > + /* path.mnt now has pin on superblock, so this must be above one */ > + atomic_dec(&s->s_active); > + > file = kernel_file_open(&path, O_RDONLY | O_DIRECTORY | O_NOATIME, > current_cred()); > + /* > + * safe even if last put because no MNT_INTERNAL means this > + * will do delayed deactivate_super and not deadlock > + */ > + mntput(path.mnt); > if (IS_ERR(file)) > return NOTIFY_DONE; > > >