Re: [PATCH 2/3] xfs: test the ascii case-insensitive hash

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 4 Apr 2023 13:51:36 -0700

[now adding hch because his name is on the original patches from 2008]

On Tue, Apr 04, 2023 at 11:06:27AM -0700, Linus Torvalds wrote:
> On Tue, Apr 4, 2023 at 10:07 AM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> >
> \> Now that we've made kernel and userspace use the same tolower code for
> > computing directory index hashes, add that to the selftest code.
> 
> Please just delete this test. It's really really fundamentally wrong.
> 
> The fact that you even *think* that you use the same tolower() as user
> space does shows just that you don't even understand how user space
> works.

Wrong.  I'm well aware that userspace tolower and kernel tolower are
*NOT* the same thing.  I'm trying to **STOP USING** tolower in XFS.

I'm **replacing** tolower with a new function that substitutes specific
bytes with other bytes, and I'm redefining the ondisk format to say that
names remain arbitrary sequences of bytes that do not include nulls or
slashes but that hash indexing and lookups will apply the above byte
transformation.  The rules for that transformation may or may not
coincide with what anyone thinks is "upper case in ASCII", but that's
irrelevant.  The rules will be the same in all places, and they will not
break any existing filesystems.

Maybe I should've named it xfs_ItB8n_ci_o43jM28() to make it clearer
that I don't care what ascii is, nor does it matter here.  Perhaps I
should have written "Now that we've made kernel and userspace perform
the same mathematical transformation on dirent name byte arrays before
hashing, add that to the selftest code as well."

Christoph and Barry Naujok should have defined specifically the exact
transformation and the permitted ranges of name inputs when they wrote
the feature.  I wasn't around when this feature was invented...

> Really. The only thing this series shows is that you do not understand
> the complexities.

...and I don't think they understood the complexities when the code was
written.

> Lookie here: compile and run this program:
> 
>     #include <stdio.h>
>     #include <ctype.h>
>     #include <locale.h>
> 
>     int main(int argc, char **argv)
>     {
>         printf("tolower(0xc4)=%#x\n", tolower(0xc4));
>         setlocale(LC_ALL, "C");
>         printf("tolower(0xc4)=%#x\n", tolower(0xc4));
>         setlocale(LC_ALL, "sv_SE.iso88591");
>         printf("tolower(0xc4)=%#x\n", tolower(0xc4));
>     }
> 
> and on my machine, I get this:
> 
>     tolower(0xc4)=0xc4
>     tolower(0xc4)=0xc4
>     tolower(0xc4)=0xe4
> 
> and the important thing to note is that "on my machine". The first
> line could be *different* on some other machine (and the last line
> could be too: there's no guarantee that the sv_SE locale even exists).
> 
> So this whole "kernel and userspace use the same tolower code"
> sentence is simply completely and utterly wrong. It's not even "wrong"
> in the sense fo "that's not true". It's "wrong" in the sense "that
> shows that you didn't understand the problem at all".
> 
> Put another way: saying "5+8=10" is wrong. But saying "5+8=tiger" is
> nonsensical.
> 
> Your patches are nonsensical.

I disagree.  I'm saying that 5 💩 8 = tiger because that's what the 💩
operator does.  💩 is not +, even if 4 💩 8 = 12.

You claim to understand the complexities; how would /you/ fix this?
I'll send along the test cases that reproduce the problems.

--D

>                Linus