Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes: > On Wed, 17 Jan 2024 at 18:06, Theodore Ts'o <tytso@xxxxxxx> wrote: >> So we don't need to worry about the user not being able to fix it, >> because they won't have been able to create the file in the first >> place. > > Yeah, that's a fine argument, until you have a bug or subtle bit flip > data corruption, and now instead of having something you can recover, > the system actively says "Nope". I know this is not your point, but I should add that, in case of a bug or bit flip, we support "fixing" the "bad utf8" string through fsck. >> I admit that when I discovered that MacOS errored out on illegal utf-8 >> characters it was mildly annoying, > > We may have to be able to interoperate with shit, but let's call it what it is. > > Nobody pretends FAT is a great filesystem that made great design > decisions. That doesn't mean that we can't interoperate with it just > fine. > > But we don't need to take those idiotic and bad design decisions to > heart, and we don't need to hide the fact that they are horrendous > design mistakes. There is a correctness issue with accepting the creation of invalid utf-8 names that justifies the existence of strict mode. Currently undefined code-points can become a casefold match to some other file in a later unicode version. When you decide to update your unicode version or even copy the file to a volume with a different version, the lookup might yield a different file, making one of them inaccessible or overwriting the wrong file. Obviously, not all corruptions would yield a "valid" undefined code-point. But those are possible. We currently don't care much, since mkfs will create the volume with a fixed, never-changed unicode version. That is, unless the user goes out of their way to shoot themselves in the foot. Strict mode is an easy way to prevent this class of issues (aside from corruptions). > So "strict" mode should mean that you can't *create* a misformed UTF-8 > filename. > > It's that same "be conservative in what you do". > > But *dammit*, if "strict" mode means that you can't even read other > peoples mistakes because your "->lookup()" function refuses to even > look at it, then "strict" mode is GARBAGE. > > That's the "be liberal in what you accept" part. Do it, or be damned. Yes, we could be more liberal in the lookup while restricting the creation of invalid utf8 sequences. But, the only case where it would matter is for corrupted volumes, where a file-name suddenly changed to something invalid. Considering ext4 and f2fs, since the disk direntry hash (which is hash(casefolded(filename))) didn't get corrupted exactly right, looking up the exact-match of the invalid name might fail. This would create an even more inconsistent semantics, where small, non-hashed directories can find these files, but larger, hashed directories might not. And that is even more confusing to users, since it exposes internal filesystem details. I get the point about how annoying the current semantics is. But I still think this is the sanest approach to a fundamentally insane feature. -- Gabriel Krisman Bertazi