Re: [PATCHSET 0/3] xfs: fix ascii-ci problems with userspace

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Tue, 4 Apr 2023 14:50:49 -0700

On Tue, Apr 4, 2023 at 2:00 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>
> If it were up to me I'd invent a userspace shim fs that would perform
> whatever normalizations are desired, and pass that (and ideally a lookup
> hash) to the underlying kernel/fs.

No, really.

It's been done.

It sucks.

Guess what Windows has been doing for *decades* because of their
ass-backwards idiotic filesystem interfaces? Pretty much exactly that.

They had this disgusting wide-char thing for the native filename
interface, historically using something like UCS-2LE (sometimes called
UTF-16), and then people convert that back and forth in random ways.
Mostly in libraries so that it's mostly invisible to you, but

Almost exactly like that "If it were up to you".

It's BS. It's incredibly bad. It's taken them decades to get away from
that mistake (I _think_ the UCS-2 interfaces are all considered
legacy)

What Windows used to do (again: I _think_ they've gotten over it), was
to have those special native wchat_t interfaces like _wfopen(), and
then when you do a regular "fopen()" call it converts the "normal"
strings to the whcar_t interface for the actual system call. IOW,
doing exactly that "shim fs" thing wrt filenames.

Sprinkle special byte-order markers in there for completeness (because
UCS-2 was a mindblowingly bad idea along with UCS-4), and add ways to
pick locale on a per-call basis, and you have *really* completed the
unholy mess.

No. Don't do there.

There is absolutely *one* right way, and one right way only: bytes are
bytes. No locale crap, and no case insensitivity.

It avoids *all* the problems, doesn't need any silly conversions, and
just *works*.

Absolutely nobody sane actually wants case insensitivity. A byte is a
byte is a byte, and your pathname is a pure byte stream.

Did people learn *nothing* from that fundamental Unix lesson?  We had
all kinds of crap filesystems before unix, with 8.3 filenames, or
record-based file contents, or all kinds of stupid ideas.

Do you want your file contents to have a "this is Japanese" marker and
special encoding? No? Then why would you want your pathnames to do
that?

Just making it a byte stream avoids all of that.

Yes, we have special characters (notably NUL and '/'), and we have a
couple of special byte stream entries ("." and ".."), but those are
all solidly unambiguous. We can take US-ASCII as a given.

And once it's a byte stream, you can use it any way you want.

Now, the *sane* way is to then use UTF-8 on top of that byte stream,
and avoid all locale issues, but *if* some user space wants to treat
those bytes as Latin1 or as Shift-JS, it still "works" for them.

This is also the reason why a filesystem *MUST NOT* think the byte
stream is UTF-8, and do some kind of unicode normalization, and reject
- or replace - byte sequences that aren't valid UTF-8.

Thinking names should have some record-based structure is as wrong as
thinking that file contents should be record-based.

And *no*, you should not have some kind of "standardized translation
layer" in user space - that's just completely unnecessary overhead for
any actual sane user. It's exactly the wrong thing to do.

Others have been there, done that. Learn from their mistakes.

           Linus