On 2024-01-16 at 00:19:20, Michael Litwak wrote: > As for documentation clarifications for the .gitattributes manpage at > https://git-scm.com/docs/gitattributes, I still suggest adding an > explicit example for UTF-16LE with BOM, and/or adding a table listing > which working-tree-encoding value to use for each of the following > UTF-16 text encodings: > > ENCODING 'working-tree-encoding' VALUE > ------------------- ----------------------------- > UTF-16LE with BOM UTF-16LE-BOM I should point out that this encoding, while very common on Windows, is also nonstandard. The standard says that UTF-16LE and UTF-16BE don't include a BOM and are always the respective endianness. UTF-16 can have a BOM or not, and if it doesn't, it's big-endian. There is no standard-conforming way to force the use of little-endian with a BOM. The problem is that many Windows programs insist on the BOM, but also refuse to read big-endian data in violation of the standard[0]. That's why this nonstandard variant exists in Git. I'll also note that this particular nonstandard variant is essentially impossible to encode reliably on Unix outside of Git because it's nonstandard, so it's an extremely unportable choice. In fact, I'm not aware of _any_ tool on my Debian system other than Git that will guarantee a UTF-16 little-endian stream with BOM. My editor (Neovim) certainly doesn't. (Apparently Emacs, which is not on my system, may permit that, which does not surprise me in the least.) > UTF-16BE with BOM UTF-16 It's a little more complicated than that. "UTF-16" would allow UTF-16 big-endian with BOM, UTF-16 little-endian with BOM, or UTF-16 big-endian without BOM. In other words, UTF-16 is big-endian by default and otherwise requires a BOM, which may be included even if not required. A reader must handle every variant of this, and must honour the BOM if set and default to big-endian if not. A writer may write whichever variant pleases it most as long as it's consistent within the same message. > UTF-16LE no BOM UTF-16LE > UTF-16BE no BOM UTF-16BE I think the addition of this table is too much. UTF-16LE-BOM is common on Windows, and the rest are substantially less common. It's also very difficult to explain in a table what "UTF-16" means in an understandable way. And I also think it's also pretty clear that users should be using UTF-8 without BOM where possible. We do already mention both UTF-16, UTF-16LE, and UTF-16LE-BOM as options in the gitattributes manual page, and it's up to the user to know what their program wants and supports if that's not UTF-8. (I would say that the user wants a new program that _does_ support UTF-8, but perhaps I'm being unrealistically harsh.) I agree it's difficult because the documentation usually doesn't indicate what's supported and all the variants are hard to understand, but that's a huge part of the reason that we recommend UTF-8. I'll also add that in general, when you do have Unix systems that read or write data in UTF-16, they handle every variant correctly. Thus, the practical choice if you steadfastly refuse to use UTF-8 is either UTF-16LE-BOM (if your Windows program has the bug I mentioned above) or UTF-16, both of which we mention already in the manual page. I'm explicitly ignoring non-file contexts here, where one may use UTF-16LE or UTF-16BE, but those are substantially less common in actual files, which is what this feature describes. > Why bother clarifying the documentation? Because These UTF-16 > encodings are commonly found on Windows systems. Notepad supports the > first two, and many Visual Studio project wizards add various files > using these encodings as well. Older versions of PowerShell saved new > .ps1 scripts using UTF-16BE with BOM as the default encoding. True, but Notepad also supports UTF-8 and has for quite a while. According to the Powershell documentation[1], there is no portable character set option for non-ASCII characters, so in general it's impossible to know. I suspect that a simple "UTF-16" will be fine here, though, since it clearly doesn't have the bug mentioned above. > Also, the current .gitattributes documentation makes frequent > reference to "UTF-16" as an encoding but fails to be clear that the > working-tree-encoding value "UTF-16" is now only for UTF-16BE with > BOM. It would be easy to assume that the working-tree-encoding value > "UTF-16" meant any UTF-16 file with a BOM (either LE or BE), which was > the original meaning of this value before UTF-16LE-BOM was added to > Git. As I said, your statement isn't correct. That's what libiconv does on Windows. On Linux, glibc uses a little-endian variant with BOM on little-endian machines. musl, if memory serves me, always uses big-endian without a BOM. All of those are valid encodings, and a UTF-16 reader must handle all of them. > Finally, I am not sure how to use git add --renormalize to correct a > UTF-16 file that was previously added incorrectly (i.e. with a missing > or incorrect working-tree-encoding entry in .gitattributes). The git > add documentation at https://git-scm.com/docs/git-add implies > 'renormalize' resets only the end-of-line values; however, I suspect > it also re-converts text encoding when a working-tree-encoding > property is set. It would be helpful to know one way or the other. It does indeed affect the working-tree-encoding. If you wanted to send an inline patch created with git format-patch, it would probably be welcome to mention that. However, because in this project we typically scratch our own itch, if you don't send one, it's likely nobody else will, either. [0] https://datatracker.ietf.org/doc/html/rfc2781 § 4.1: “All applications that process text with the "UTF-16" charset label MUST be able to interpret both big- endian and little-endian text.” [1] https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_character_encoding?view=powershell-7.4 -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA
Attachment:
signature.asc
Description: PGP signature