Re: [RFC PATCH v2 8/9] namespace: add sb_revalidate_bindmounts helper

Aleksandr Mikhalitsyn <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> · Wed, 5 Apr 2023 20:45:10 +0200

On Mon, Apr 3, 2023 at 11:14 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
>
> On Mon, Apr 03, 2023 at 04:45:16PM +0200, Alexander Mikhalitsyn wrote:
> > Useful if for some reason bindmounts root dentries get invalidated
> > but it's needed to revalidate existing bindmounts without remounting.
> >
> > Cc: Miklos Szeredi <mszeredi@xxxxxxxxxx>
> > Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
> > Cc: Amir Goldstein <amir73il@xxxxxxxxx>
> > Cc: Stéphane Graber <stgraber@xxxxxxxxxx>
> > Cc: Seth Forshee <sforshee@xxxxxxxxxx>
> > Cc: Christian Brauner <brauner@xxxxxxxxxx>
> > Cc: Andrei Vagin <avagin@xxxxxxxxx>
> > Cc: Pavel Tikhomirov <ptikhomirov@xxxxxxxxxxxxx>
> > Cc: Bernd Schubert <bschubert@xxxxxxx>
> > Cc: linux-fsdevel@xxxxxxxxxxxxxxx
> > Cc: linux-kernel@xxxxxxxxxxxxxxx
> > Cc: criu@xxxxxxxxxx
> > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@xxxxxxxxxxxxx>
> > ---
> >  fs/namespace.c                | 90 +++++++++++++++++++++++++++++++++++
> >  include/linux/mnt_namespace.h |  3 ++
> >  2 files changed, 93 insertions(+)
> >
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index bc0f15257b49..b74d00d6abb0 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -568,6 +568,96 @@ static int mnt_make_readonly(struct mount *mnt)
> >       return ret;
> >  }
> >
> > +struct bind_mount_list_item {
> > +     struct list_head list;
> > +     struct vfsmount *mnt;
> > +};
> > +
> > +/*
> > + * sb_revalidate_bindmounts - Relookup/reset bindmounts root dentries
> > + *
> > + * Useful if for some reason bindmount root dentries get invalidated
> > + * but it's needed to revalidate existing bindmounts without remounting.
> > + */
> > +int sb_revalidate_bindmounts(struct super_block *sb)

Hi Christian!

>
> It's difficult to not strongly dislike this patchset on the basis of the
> need for a function like this alone. And I just have to say it even if I
> normally try not to do this: This whole function is bonkers in my opinion.

This function is the main pain point in this patchset, I believe :)
I thought a lot about other ways (without this helper) and all of them
have their pros and cons.

And that's the reason why this patchset is RFC. I wanted to hear from
you and other folks
about your point of view on the problem. As far as I can see there is
some interest to that problem,
and implementation question is important, but if there is no
objections against the concept of
FUSE mounts healing (and Checkpoint/Restore too as a more complex
case) then I think we will
be able to agree on some approach to the problem. I'm ready to rework
that in the way it is supposed to be done.

>
> But leaving that aside for a second. This really needs detailed
> explanations on locking, assumptions, and an explanation what you're
> doing here. This looks crazy to me and definitely isn't fit to be a
> generic helper in this form.

Of course.

>
> > +{
> > +     struct mount *mnt;
> > +     struct bind_mount_list_item *bmnt, *next;
> > +     int err = 0;
> > +     struct vfsmount *root_mnt = NULL;
> > +     LIST_HEAD(mnt_to_update);
> > +     char *buf;
> > +
> > +     buf = (char *) __get_free_page(GFP_KERNEL);
> > +     if (!buf)
> > +             return -ENOMEM;
> > +
> > +     lock_mount_hash();
> > +     list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
> > +             /* we only want to touch bindmounts */
> > +             if (mnt->mnt.mnt_root == sb->s_root) {
> > +                     if (!root_mnt)
> > +                             root_mnt = mntget(&mnt->mnt);
> > +
> > +                     continue;
> > +             }
> > +
> > +             bmnt = kzalloc(sizeof(struct bind_mount_list_item), GFP_NOWAIT | __GFP_NOWARN);
>
> Allocating under lock_mount_hash() even if doable with this flag
> combination should be avoided at all costs imho. That just seems hacky.

got it

>
> > +             if (!bmnt) {
> > +                     err = -ENOMEM;
> > +                     goto exit;
>
> You're exiting the function with lock_mount_hash() held...

+

>
> > +             }
> > +
> > +             bmnt->mnt = mntget(&mnt->mnt);
> > +             list_add_tail(&bmnt->list, &mnt_to_update);
> > +     }
> > +     unlock_mount_hash();
>
> You've dropped unlock_mount_hash() and the function doesn't hold
> namespace_lock() and isn't documented as requiring the caller to hold
> it. And the later patch that uses this doesn't afaict (although I
> haven't looked at any of the fuse specific stuff). So this is open to
> all sorts of races with mount and unmount now. This needs an explanation
> why that doesn't matter.

yes, it doesn't matter, my idea was that this helper should
"actualize" roots for the bind mounts
at the point of call, but it seems no problem if we skip some new
mounts, because
This helper can be called again. I didn't want to take a namespace_lock() there
because we don't want to have such guarantees for this interface.

Of course, this has to be documented. Thanks!

>
> > +
> > +     /* TODO: get rid of this limitation */
>
> Confused about this comment.

Sorry.

This comment means that current implementation relies on the fact that
each fuse superblock has some
root mount in the list (mnt->mnt.mnt_root == sb->s_root). And that's
not always the case, because user
can do something like this:
fusermount ... /mnt/my_fuse_mount
mount --bind /mnt/my_fuse_mount/subtree /mnt/my_fuse_mount_subtree
umount /mnt/my_fuse_mount

In this case, root_mnt == NULL. But we need to have root_mnt, to pass
it in vfs_path_lookup(..) later.

>
> > +     if (!root_mnt) {
> > +             err = -ENOENT;
> > +             goto exit;
> > +     }
> > +
> > +     list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
>
> No one else can access that list so list_for_each_entry_safe() seems
> pointless.

+

>
> > +             struct vfsmount *cur_mnt = bmnt->mnt;
> > +             struct path path;
> > +             struct dentry *old_root;
> > +             char *p;
> > +
> > +             p = dentry_path(cur_mnt->mnt_root, buf, PAGE_SIZE);
> > +             if (IS_ERR(p))
> > +                     goto exit;
> > +
> > +             /* TODO: are these lookup flags fully safe and correct? */
> > +             err = vfs_path_lookup(root_mnt->mnt_root, root_mnt,
> > +                             p, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT|LOOKUP_REVAL, &path);
> > +             if (err)
> > +                     goto exit;
> > +
> > +             /* replace bindmount root dentry */
> > +             lock_mount_hash();
> > +             old_root = cur_mnt->mnt_root;
> > +             cur_mnt->mnt_root = dget(path.dentry);
>
> mnt->mnt_root isn't protected by lock_mount_hash(). It's invariant after
> it has been set and a lot of code just assumes that it's stable.
>
> Top of my hat, since you're not holding namespace_lock() mount
> propagation can be going on concurrently so propagate_one() can check if
> (!is_subdir(mp->m_dentry, m->mnt.mnt_root)) while you're happily
> updating it. A lot of code could actually be touching mnt->mnt_root
> while you're updating it.
>
> There's probably a lot more issues with this I'm just not seeing without
> spending more time on this. But NAK on this as it stands. Sorry.

Thanks for looking into it, Christian!

Kind regards,
Alex

>
> > +             dput(old_root);
> > +             unlock_mount_hash();
> > +
> > +             path_put(&path);
> > +     }
> > +
> > +exit:
> > +     free_page((unsigned long) buf);
> > +     mntput(root_mnt);
> > +     list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
> > +             list_del(&bmnt->list);
> > +             mntput(bmnt->mnt);
> > +             kfree(bmnt);
> > +     }
> > +
> > +     return err;
> > +}
> > +EXPORT_SYMBOL_GPL(sb_revalidate_bindmounts);