Re: [PATCHSET 0/3] xfs: fix ascii-ci problems with userspace

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 4 Apr 2023 14:00:09 -0700

On Tue, Apr 04, 2023 at 01:21:25PM -0700, Linus Torvalds wrote:
> On Tue, Apr 4, 2023 at 11:19 AM Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > Limiting yourself to US-ASCII is at least technically valid. Because
> > EBCDIC isn't worth worrying about.  But when the high bit is set, you
> > had better not touch it, or you need to limit it spectacularly.
> 
> Side note: if limiting it to US-ASCII is fine (and it had better be,
> because as mentioned, anything else will result in unresolvable
> problems), you might look at using this as the pre-hash function:
> 
>     unsigned char prehash(unsigned char c)
>     {
>         unsigned char mask = (~(c >> 1) & c & 64) >> 1;
>         return c & ~mask;
>     }
> 
> which does modify a few other characters too, but nothing that matters
> for hashing.
> 
> The advantage of the above is that you can trivially vectorize it. You
> can do it with just regular integer math (64 bits = 8 bytes in
> parallel), no need to use *actual* vector hardware.
> 
> The actual comparison needs to do the careful thing (because '~' and
> '^' may hash to the same value, but obviously aren't the same), but
> even there you can do a cheap "are these 8 characters _possibly_ the
> same) with a very simple single 64-bit comparison, and only go to the
> careful path if things match, ie
> 
>     /* Cannot possibly be equal even case-insentivitely? */
>     if ((word1 ^ word2) & ~0x2020202020202020ul)
>         continue;
>     /* Ok, same in all but the 5th bits, go be careful */
>     ....
> 
> and the reason I mention this is because I have been idly thinking
> about supporting case-insensitivity at the VFS layer for multiple
> decades, but have always decided that it's *so* nasty that I really
> was hoping it just is never an issue in practice.

If it were up to me I'd invent a userspace shim fs that would perform
whatever normalizations are desired, and pass that (and ideally a lookup
hash) to the underlying kernel/fs.  Users can configure whatever
filtering they want and set LC_ALL as they please, and we kernel
developers never have to know, and the users never have to see what
actually gets written to disk.  If users want normalized ci lookups, the
shim can do that.

ext4 tried to do better than XFS by actually defining the mathematical
transformation that would be applied to incoming names and refusing
things that would devolve into brokenness, but then it turns out that it
was utf8_data.c.  Urgh.

I get it, shi* fses are not popular and are not fast, but if the Samba
benchmarks are still valid, multiple kernel<->fuserspace transitions are
still faster that their workaround.

> Particularly since the low-level filesystems then inevitably decide
> that they need to do things wrong and need a locale, and at that point
> all hope is lost.
> 
> I was hoping xfs would be one of the sane filesystems.

Hah, nope, I'm all out of sanity here. :(

--D

>                Linus