On Wed, Jan 17, 2024 at 04:40:17PM -0800, Linus Torvalds wrote: > Note that the whole "malformed utf-8 is an error" is actually wrong anyway. > > Yes, if you *output* utf-8, and your output is malformed, then that's > an error that needs fixing. > > But honestly, "malformed utf-8" on input is almost always just "oh, it > wasn't utf-8 to begin with, and somebody is still using Latin-1 or > Shift-JIS or whatever". > > And then treating that as some kind of hard error is actually really > really wrong and annoying, and may end up meaning that the user cannot > *fix* it, because they can't access the data at all. A file system which supports casefolding can support "strict" mode (not the default) where attempts to create files that have invalid UTF-8 characters are rejected before a file or hard link is created (or renamed) with an error. This is what MacOS does, by the way. If you try to rsync a file from a Linux box where the file was created by unpacking a Windows Zip file created by downloading a directory hierarchy from a Microsoft Sharepoint, and then you try to scp or rsync it over to MacOS, MacOS will will refuse to allow the file to be created if it contains invalid UTF-8 characters, and rsync or scp will report an error. I just ran into this earlier today... So we don't need to worry about the user not being able to fix it, because they won't have been able to create the file in the first place. This is not the default, since we know there are a bunch of users who might be creating files using the unofficial "Klingon" characters (for example) that are not officially part of Unicode since Unicode will only allow characters used by human languages, and Klingon doesn't qualify. I believe though that Android has elected to enable casefolding in strict mode, which is fine as far as I'm concerned. > I find libraries that just error out on "malformed utf-8" to be > actively harmful. I admit that when I discovered that MacOS errored out on illegal utf-8 characters it was mildly annoying, but it wasn't that hard to fix it on the Linux side and then I retried the rsync. It also turned out that if I unpacked the zip file on MacOS, the filename was created without the illegal utf-8 characters, so there may have been something funky going on with the zip userspace program on Linux. I haven't cared enough to try to debug it... - Ted