Re: [RFC 1/1] fix NULL mnt [was Re: apparmor NULL pointer dereference on resume [efivarfs]]

Christian Brauner <brauner@xxxxxxxxxx> · Sun, 16 Mar 2025 07:46:31 +0100

On Sat, Mar 15, 2025 at 02:41:43PM -0400, James Bottomley wrote:
> On Sat, 2025-03-15 at 11:04 +0100, Christian Brauner wrote:
> [...]
> > Since efivars uses a single global superblock and we know that sfi-
> > >sb is still alive (After all we've just pinned it above.)
> > vfs_kern_mount() will reuse the same superblock.
> > 
> > There's two cases to consider:
> > 
> > (1) vfs_kern_mount() was successful. In this case path->mnt will hold
> > an active superblock reference that will be released asynchronously
> > via __fput(). That is safe and correct.
> > 
> > (2) vfs_kern_mount() fails. That's an issue because you need to call
> > deactivate_super() which will have a similar deadlock problem.
> > If efivarfs_pm_notify() now holds the last reference to the
> > superblock then deactivate_super() super will put that last
> > reference and call efivarfs_kill_super() which in turn will wait for
> > efivarfs_pm_notify() to finish. => deadlock
> > 
> > So in the error case you need to offload the call to
> > deactivate_super() to a workqueue.
> 
> OK, got that (although it did make my head explode a bit ... this is
> certainly subtle stuff).  To do the delayed work for the
> deactivate_super(), I hijacked the superblock destroy_work structure
> which I think is safe because by the time the work structure is
> executed, we own it and so it can be reused for the actual destroy_work
> in deactivate_super().
> 
> However, there's another problem: the mntput after kernel_file_open may
> synchronously call cleanup_mnt() (and thus deactivate_super()) if the
> open fails because it's marked MNT_INTERNAL, which is caused by
> SB_KERNMOUNT.  I fixed this just by not passing the SB_KERNMOUNT flag,
> which feels a bit hacky.

It actually isn't. We know that vfs_kern_mount() will always resurface
the single superblock that's exposed to userspace because we've just
taken a reference to it earlier in efivarfs_pm_notify(). So that
SB_KERNMOUNT flag is ignored because no new superblock is allocated. It
would only matter if we'd end up allocating a new superblock which we
never do.

And if we did it would be a bug because the superblock we allocate could
be reused at any time if a userspace task mounts efivarfs before
efivarfs_pm_notify() has destroyed it (or the respective workqueue).
But that superblock would then have SB_KERNMOUNT for something that's
not supposed to be one.

And whether or not that helper mount has MNT_INTERNAL is immaterial to
what you're doing here afaict.

So not passing the SB_KERNMOUNT flag is the right thing (see devtmpfs as
well). You could slap a comment in here explaining that we never
allocate a new superblock so it's clear to people not familiar with this
particular code.

> 
> I've put together everything at the bottom, however, I can't test the
> error legs of this because trying to trigger and unmount and hybernate
> at exactly the right point is pretty much impossible.  The rest seems
> to work as advertised, although I would like a tested-by from the
> apparmour people because I do run apparmour in my debian test rig but
> don't see the problem.
> 
> Regards,
> 
> James
> 
> ---
> 
> diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
> index 6eae8cf655c1..2d826e98066b 100644
> --- a/fs/efivarfs/super.c
> +++ b/fs/efivarfs/super.c
> @@ -474,12 +474,25 @@ static int efivarfs_check_missing(efi_char16_t *name16, efi_guid_t vendor,
>  	return err;
>  }
>  
> +static void efivarfs_deactivate_super_work(struct work_struct *work)
> +{
> +	struct super_block *s = container_of(work, struct super_block,
> +					     destroy_work);
> +	/*
> +	 * note: here s->destroy_work is free for reuse (which
> +	 * will happen in deactivate_super)
> +	 */
> +	deactivate_super(s);
> +}
> +
> +static struct file_system_type efivarfs_type;
> +
>  static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
>  			      void *ptr)
>  {
>  	struct efivarfs_fs_info *sfi = container_of(nb, struct efivarfs_fs_info,
>  						    pm_nb);
> -	struct path path = { .mnt = NULL, .dentry = sfi->sb->s_root, };
> +	struct path path;
>  	struct efivarfs_ctx ectx = {
>  		.ctx = {
>  			.actor	= efivarfs_actor,
> @@ -487,6 +500,7 @@ static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
>  		.sb = sfi->sb,
>  	};
>  	struct file *file;
> +	struct super_block *s = sfi->sb;
>  	static bool rescan_done = true;
>  
>  	if (action == PM_HIBERNATION_PREPARE) {
> @@ -499,11 +513,39 @@ static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
>  	if (rescan_done)
>  		return NOTIFY_DONE;
>  
> +	/* ensure single superblock is alive and pin it */
> +	if (!atomic_inc_not_zero(&s->s_active))
> +		return NOTIFY_DONE;
> +
>  	pr_info("efivarfs: resyncing variable state\n");
>  
> -	/* O_NOATIME is required to prevent oops on NULL mnt */
> +	path.dentry = sfi->sb->s_root;
> +
> +	/* do not add SB_KERNMOUNT which causes MNT_INTERNAL, see below */
> +	path.mnt = vfs_kern_mount(&efivarfs_type, 0,
> +				  efivarfs_type.name, NULL);
> +	if (IS_ERR(path.mnt)) {
> +		pr_err("efivarfs: internal mount failed\n");
> +		/*
> +		 * We may be the last pinner of the superblock but
> +		 * calling efivarfs_kill_sb from within the notifier
> +		 * here would deadlock trying to unregister it
> +		 */
> +		INIT_WORK(&s->destroy_work, efivarfs_deactivate_super_work);
> +		schedule_work(&s->destroy_work);
> +		return PTR_ERR(path.mnt);
> +	}
> +
> +	/* path.mnt now has pin on superblock, so this must be above one */
> +	atomic_dec(&s->s_active);
> +
>  	file = kernel_file_open(&path, O_RDONLY | O_DIRECTORY | O_NOATIME,
>  				current_cred());
> +	/*
> +	 * safe even if last put because no MNT_INTERNAL means this
> +	 * will do delayed deactivate_super and not deadlock
> +	 */
> +	mntput(path.mnt);
>  	if (IS_ERR(file))
>  		return NOTIFY_DONE;
>  
> 
>