Re: [PATCH v2] libfs: Attempt exact-match comparison first during casefold lookup

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 18 Jan 2024 11:29:31 -0800

On Thu, 18 Jan 2024 at 07:42, Gabriel Krisman Bertazi <krisman@xxxxxxx> wrote:
>
> But I don't see how this could be a problem.  the str pointer itself
> can't change since it's already a copy of the dentry->d_name pointer; if
> the string is external, it is guaranteed to exist until the end of the
> lookup; Finally, If it's inlined, the string can change concurrently
> which, in the worst case scenario, gives us a false result. But then,
> VFS will retry, like we do for the case-insensitive comparison right
> below.
>
> ..Right? :)

Wrong, for two subtle reasons.

The issue is not that the string can go away. The issue is that the
string and the length have been loaded independently - and may not
match.

So when you do that

        if (len == name->len && !memcmp(str, name->name, len))

the 'name->len' you test for equality with 'len' may not match the
length of the string allocated in 'name->name', because they are two
different loads of two different values, and we do not hold the lock
that makes them consistent.

See the big comment (and the READ_ONCE()" in dentry_cmp(), and notice
how dentry_cmp() intentionally doesn't use 'name->len' at all.

Subtle, subtle - but this is *incredibly* performance-critical code.
Locks are completely out of the question - they would make the whole
RCU pathwalk completely pointless, and the RCU path-walk with no
stores to the dentry cache is what makes the dentry cache perform so
well.

So it's not even that locks have contention, it's literally that RCU
path lookup treats the dentries as read-only and you actually get
shared cachelines and true parallel lookups. No reference count
updates, no *nothing* like that.

This is why dentry_cmp() does that magical dentry_string_cmp(), which
is very subtle: it knows that regardless of *which* source of naem we
have, the word-aligned reads of d_name->name are safe, because
 (a) the allocations are word-aligned
 (b) the dentry name allocations all end with a NUL byte (unlike the
pathname ones that can have '/' at the end of the name)
 (c) the *pathname* length is reliable, and the previous word didn't
have a NUL byte in it because that would have ended the compare
 (d) so that "read one word from cs" is safe, *despite* the fact that
the length we have isn't something we can rely on (it comes from the
pathname side, of course).

And yes, this code is subtle. There are very few people who really
understand all the dentry rules. But it is *the* most critical piece
of code in the kernel for a lot of real-life loads. Looking up
pathnames is pretty much "Job #1" of an OS - a lot of the rest is
"details".

                   Linus