On Tue, Apr 4, 2023 at 2:00 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > If it were up to me I'd invent a userspace shim fs that would perform > whatever normalizations are desired, and pass that (and ideally a lookup > hash) to the underlying kernel/fs. No, really. It's been done. It sucks. Guess what Windows has been doing for *decades* because of their ass-backwards idiotic filesystem interfaces? Pretty much exactly that. They had this disgusting wide-char thing for the native filename interface, historically using something like UCS-2LE (sometimes called UTF-16), and then people convert that back and forth in random ways. Mostly in libraries so that it's mostly invisible to you, but Almost exactly like that "If it were up to you". It's BS. It's incredibly bad. It's taken them decades to get away from that mistake (I _think_ the UCS-2 interfaces are all considered legacy) What Windows used to do (again: I _think_ they've gotten over it), was to have those special native wchat_t interfaces like _wfopen(), and then when you do a regular "fopen()" call it converts the "normal" strings to the whcar_t interface for the actual system call. IOW, doing exactly that "shim fs" thing wrt filenames. Sprinkle special byte-order markers in there for completeness (because UCS-2 was a mindblowingly bad idea along with UCS-4), and add ways to pick locale on a per-call basis, and you have *really* completed the unholy mess. No. Don't do there. There is absolutely *one* right way, and one right way only: bytes are bytes. No locale crap, and no case insensitivity. It avoids *all* the problems, doesn't need any silly conversions, and just *works*. Absolutely nobody sane actually wants case insensitivity. A byte is a byte is a byte, and your pathname is a pure byte stream. Did people learn *nothing* from that fundamental Unix lesson? We had all kinds of crap filesystems before unix, with 8.3 filenames, or record-based file contents, or all kinds of stupid ideas. Do you want your file contents to have a "this is Japanese" marker and special encoding? No? Then why would you want your pathnames to do that? Just making it a byte stream avoids all of that. Yes, we have special characters (notably NUL and '/'), and we have a couple of special byte stream entries ("." and ".."), but those are all solidly unambiguous. We can take US-ASCII as a given. And once it's a byte stream, you can use it any way you want. Now, the *sane* way is to then use UTF-8 on top of that byte stream, and avoid all locale issues, but *if* some user space wants to treat those bytes as Latin1 or as Shift-JS, it still "works" for them. This is also the reason why a filesystem *MUST NOT* think the byte stream is UTF-8, and do some kind of unicode normalization, and reject - or replace - byte sequences that aren't valid UTF-8. Thinking names should have some record-based structure is as wrong as thinking that file contents should be record-based. And *no*, you should not have some kind of "standardized translation layer" in user space - that's just completely unnecessary overhead for any actual sane user. It's exactly the wrong thing to do. Others have been there, done that. Learn from their mistakes. Linus