Re: [PATCH] vfs: avoid spurious dentry ref/unref cycle on open

Mateusz Guzik <mjguzik@xxxxxxxxx> · Tue, 6 Aug 2024 18:09:43 +0200

On Tue, Aug 6, 2024 at 5:53 PM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, Aug 06, 2024 at 04:46:28PM +0200, Mateusz Guzik wrote:
>
> > The flag thing is optional and can be dropped, but I think the general
> > direction should be to add *more* asserts and whatnot (even if they are
> > to land separately). A debug-only variant would not hurt.
>
> Asserts do *not* clarify anything; if you want your optional flag,
> come up with clear description of its semantics.  In terms of state,
> _not_ control flow.
>

It is supposed to indicate that both nd->path.mnt and nd->path.dentry
are no longer usable and must not even be looked at. Ideally code
which *does* look at them despite the flag (== there is a bug) traps.

However, I did not find a handy macro or anything of the sort to
"poison" these pointers. Instead I found tons of NULL checks all over,
including in lookup clean up.

So as is I agree the flag adds next to nothing as is, but the intent
was to catch any use of the above pointers after what they point to
got consumed. Perhaps I should just drop the flag for the time being
and only propose it with a more fleshed out scheme later.

> > @@ -3683,6 +3685,7 @@ static const char *open_last_lookups(struct nameidata *nd,
> >  static int do_open(struct nameidata *nd,
> >                  struct file *file, const struct open_flags *op)
> >  {
> > +     struct vfsmount *mnt;
> >       struct mnt_idmap *idmap;
> >       int open_flag = op->open_flag;
> >       bool do_truncate;
> > @@ -3720,11 +3723,22 @@ static int do_open(struct nameidata *nd,
> >               error = mnt_want_write(nd->path.mnt);
> >               if (error)
> >                       return error;
> > +             /*
> > +              * We grab an additional reference here because vfs_open_consume()
> > +              * may error out and free the mount from under us, while we need
> > +              * to undo write access below.
> > +              */
> > +             mnt = mntget(nd->path.mnt);
>
> It's "after vfs_open_consume() we no longer own the reference in nd->path.mnt",
> error or no error...
>

ok

> >               do_truncate = true;
>
>
> >       }
> >       error = may_open(idmap, &nd->path, acc_mode, open_flag);
> > -     if (!error && !(file->f_mode & FMODE_OPENED))
> > -             error = vfs_open(&nd->path, file);
> > +     if (!error && !(file->f_mode & FMODE_OPENED)) {
> > +             BUG_ON(nd->state & ND_PATH_CONSUMED);
> > +             error = vfs_open_consume(&nd->path, file);
> > +             nd->state |= ND_PATH_CONSUMED;
> > +             nd->path.mnt = NULL;
> > +             nd->path.dentry = NULL;
>
> Umm...  The thing that feels wrong here is that you get an extra
> asymmetry with ->atomic_open() ;-/  We obviously can't do that
> kind of thing there (if nothing else, we have the parent directory's
> inode to unlock, error or no error).
>
> I don't hate that patch, but it really feels like the calling
> conventions are not right.  Let me try to tweak it a bit and
> see if anything falls out...

Should you write your own patch I'm happy to give it a spin to
validate the win is about the same.

I forgot to note some stuff when sending the patch, so here it is:
perf is highly unstable between re-runs of the benchmark because
struct inode itself is not aligned to anything and at least ext4 plops
it in in a rather arbitrary place and uses a kmem cache without any
magic alignment guarantees. This results in false sharing showing up
and disappearing depending on how (un)lucky one gets. For reported
results I picked the worst case.

I had to patch one sucker for constantly getting in:
https://lore.kernel.org/linux-security-module/20240806133607.869394-1-mjguzik@xxxxxxxxx/T/#u

-- 
Mateusz Guzik <mjguzik gmail.com>