Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput -33.7% regression

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Tue, 2 Jul 2024 11:41:44 -0700

On Tue, 2 Jul 2024 at 10:58, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>
> Suppose the rcu fast path lookup reads the dentry seqc, then does all
> the legitimize_mnt and other work. Everything, except modifying the
> lockref. The caller is given a mnt to put (per-cpu scalable), dentry
> seqc read before any of the path validation and an indication this is
> rcu.

Yes.

> Then after whatever is done if the seqc still matches this is the same
> as if there was lockref get/put around it.

So this is partly why I was thinking of a callback. That "check
sequence number afterwards" is still important. And if it's a
callback, it can be done in the path walking code, and it can go on
and say "oh, I'll need to redo this without RCU".

If it's a "we returned a dentry under RCU", suddenly the caller has to
know this about the name lookup and do the repeating by hand.

And as long as we don't expose it to modules and only use it for
"stat()" and friends, I'm ok with it, but I'm just saying that it's
all a bit scary.

> The only worry is pointers suddenly going NULL or similar as
> dentry/inode is looked at. To be worked out on per-syscall basis.

We have subtle rules wrt dentry->d_inode. It can indeed become NULL at
any time during the RCU walk, since what protects it is the d_lock and
the dentry count.

The inode itself is then RCU-free'd, so it will *exist*, but you can't
just blindly use dentry->d_inode itself while under RCU.

Which is why it's cached in 'struct nameidata', and we validate it
with nd->seq when it's loaded. And why things like may_lookup() use
nd->inode, not the dentry.

And that's another rule that we probably should aim to not have escape
from the path walking as an interface.

Because it's much too easy to do

        struct inode *inode = d_backing_inode(path->dentry);

but that's just wrong during the RCU path walk.

Again, having this be a callback during the walk would avoid issues
like this. The callback can just pass in the separate inode pointer.

And then a sequence point failure will return -ECHILD and do the walk
again, while a callback success with all the sequence numbers matching
would return -ECALLBACK or whatever, so that the caller would know
"the stat information was already successfully completed by the
callback".

Anyway, that was my handwavy "this is why I was thinking of a
callback" thing. But it's also an example of just how nasty and subtle
this all is.

But I'm convinced this is all eminently *solvable*. There's nothing
fundamental here. Just a lot of small nasty details.

                    Linus