Re: [PATCH] vfs: avoid spurious dentry ref/unref cycle on open

Mateusz Guzik <mjguzik@xxxxxxxxx> · Tue, 6 Aug 2024 18:14:25 +0200

On Tue, Aug 6, 2024 at 6:09 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>
> On Tue, Aug 6, 2024 at 5:53 PM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Tue, Aug 06, 2024 at 04:46:28PM +0200, Mateusz Guzik wrote:
> >
> > > The flag thing is optional and can be dropped, but I think the general
> > > direction should be to add *more* asserts and whatnot (even if they are
> > > to land separately). A debug-only variant would not hurt.
> >
> > Asserts do *not* clarify anything; if you want your optional flag,
> > come up with clear description of its semantics.  In terms of state,
> > _not_ control flow.
> >
>
> It is supposed to indicate that both nd->path.mnt and nd->path.dentry
> are no longer usable and must not even be looked at. Ideally code
> which *does* look at them despite the flag (== there is a bug) traps.
>
> However, I did not find a handy macro or anything of the sort to
> "poison" these pointers. Instead I found tons of NULL checks all over,
> including in lookup clean up.
>
> So as is I agree the flag adds next to nothing as is, but the intent
> was to catch any use of the above pointers after what they point to
> got consumed. Perhaps I should just drop the flag for the time being
> and only propose it with a more fleshed out scheme later.
>
> > > @@ -3683,6 +3685,7 @@ static const char *open_last_lookups(struct nameidata *nd,
> > >  static int do_open(struct nameidata *nd,
> > >                  struct file *file, const struct open_flags *op)
> > >  {
> > > +     struct vfsmount *mnt;
> > >       struct mnt_idmap *idmap;
> > >       int open_flag = op->open_flag;
> > >       bool do_truncate;
> > > @@ -3720,11 +3723,22 @@ static int do_open(struct nameidata *nd,
> > >               error = mnt_want_write(nd->path.mnt);
> > >               if (error)
> > >                       return error;
> > > +             /*
> > > +              * We grab an additional reference here because vfs_open_consume()
> > > +              * may error out and free the mount from under us, while we need
> > > +              * to undo write access below.
> > > +              */
> > > +             mnt = mntget(nd->path.mnt);
> >
> > It's "after vfs_open_consume() we no longer own the reference in nd->path.mnt",
> > error or no error...
> >
>
> ok
>
> > >               do_truncate = true;
> >
> >
> > >       }
> > >       error = may_open(idmap, &nd->path, acc_mode, open_flag);
> > > -     if (!error && !(file->f_mode & FMODE_OPENED))
> > > -             error = vfs_open(&nd->path, file);
> > > +     if (!error && !(file->f_mode & FMODE_OPENED)) {
> > > +             BUG_ON(nd->state & ND_PATH_CONSUMED);
> > > +             error = vfs_open_consume(&nd->path, file);
> > > +             nd->state |= ND_PATH_CONSUMED;
> > > +             nd->path.mnt = NULL;
> > > +             nd->path.dentry = NULL;
> >
> > Umm...  The thing that feels wrong here is that you get an extra
> > asymmetry with ->atomic_open() ;-/  We obviously can't do that
> > kind of thing there (if nothing else, we have the parent directory's
> > inode to unlock, error or no error).
> >
> > I don't hate that patch, but it really feels like the calling
> > conventions are not right.  Let me try to tweak it a bit and
> > see if anything falls out...
>

fwiw if you are looking to have vfs_open_${keyword} which accepts
nameidata and "consumes" it, I did not go for it because vfs_open is
in another file and I did not want to "leak" the struct there.

> Should you write your own patch I'm happy to give it a spin to
> validate the win is about the same.
>
> I forgot to note some stuff when sending the patch, so here it is:
> perf is highly unstable between re-runs of the benchmark because
> struct inode itself is not aligned to anything and at least ext4 plops
> it in in a rather arbitrary place and uses a kmem cache without any
> magic alignment guarantees. This results in false sharing showing up
> and disappearing depending on how (un)lucky one gets. For reported
> results I picked the worst case.
>
> I had to patch one sucker for constantly getting in:
> https://lore.kernel.org/linux-security-module/20240806133607.869394-1-mjguzik@xxxxxxxxx/T/#u
>
> --
> Mateusz Guzik <mjguzik gmail.com>

-- 
Mateusz Guzik <mjguzik gmail.com>