Re: Suggested clarification for .gitattributes reference documentation

"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> · Tue, 16 Jan 2024 02:06:47 +0000

On 2024-01-16 at 00:19:20, Michael Litwak wrote:
> As for documentation clarifications for the .gitattributes manpage at
> https://git-scm.com/docs/gitattributes, I still suggest adding an
> explicit example for UTF-16LE with BOM, and/or adding a table listing
> which working-tree-encoding value to use for each of the following
> UTF-16 text encodings:
> 
> ENCODING              'working-tree-encoding' VALUE
> -------------------   -----------------------------
> UTF-16LE with BOM     UTF-16LE-BOM

I should point out that this encoding, while very common on Windows, is
also nonstandard.  The standard says that UTF-16LE and UTF-16BE don't
include a BOM and are always the respective endianness.  UTF-16 can have
a BOM or not, and if it doesn't, it's big-endian.

There is no standard-conforming way to force the use of little-endian
with a BOM.  The problem is that many Windows programs insist on the
BOM, but also refuse to read big-endian data in violation of the
standard[0].  That's why this nonstandard variant exists in Git.

I'll also note that this particular nonstandard variant is essentially
impossible to encode reliably on Unix outside of Git because it's
nonstandard, so it's an extremely unportable choice.  In fact, I'm not
aware of _any_ tool on my Debian system other than Git that will
guarantee a UTF-16 little-endian stream with BOM.  My editor (Neovim)
certainly doesn't.  (Apparently Emacs, which is not on my system, may
permit that, which does not surprise me in the least.)

> UTF-16BE with BOM     UTF-16

It's a little more complicated than that.  "UTF-16" would allow UTF-16
big-endian with BOM, UTF-16 little-endian with BOM, or UTF-16 big-endian
without BOM.  In other words, UTF-16 is big-endian by default and
otherwise requires a BOM, which may be included even if not required.

A reader must handle every variant of this, and must honour the BOM if
set and default to big-endian if not.  A writer may write whichever
variant pleases it most as long as it's consistent within the same
message.

> UTF-16LE no BOM       UTF-16LE
> UTF-16BE no BOM       UTF-16BE

I think the addition of this table is too much.  UTF-16LE-BOM is common
on Windows, and the rest are substantially less common.  It's also very
difficult to explain in a table what "UTF-16" means in an understandable
way.  And I also think it's also pretty clear that users should be using
UTF-8 without BOM where possible.

We do already mention both UTF-16, UTF-16LE, and UTF-16LE-BOM as options
in the gitattributes manual page, and it's up to the user to know what
their program wants and supports if that's not UTF-8.  (I would say that
the user wants a new program that _does_ support UTF-8, but perhaps I'm
being unrealistically harsh.)  I agree it's difficult because the
documentation usually doesn't indicate what's supported and all the
variants are hard to understand, but that's a huge part of the reason
that we recommend UTF-8.

I'll also add that in general, when you do have Unix systems that read
or write data in UTF-16, they handle every variant correctly.  Thus, the
practical choice if you steadfastly refuse to use UTF-8 is either
UTF-16LE-BOM (if your Windows program has the bug I mentioned above) or
UTF-16, both of which we mention already in the manual page.

I'm explicitly ignoring non-file contexts here, where one may use
UTF-16LE or UTF-16BE, but those are substantially less common in actual
files, which is what this feature describes.

> Why bother clarifying the documentation?  Because These UTF-16
> encodings are commonly found on Windows systems.  Notepad supports the
> first two, and many Visual Studio project wizards add various files
> using these encodings as well.  Older versions of PowerShell saved new
> .ps1 scripts using UTF-16BE with BOM as the default encoding.

True, but Notepad also supports UTF-8 and has for quite a while.
According to the Powershell documentation[1], there is no portable
character set option for non-ASCII characters, so in general it's
impossible to know.  I suspect that a simple "UTF-16" will be fine here,
though, since it clearly doesn't have the bug mentioned above.

> Also, the current .gitattributes documentation makes frequent
> reference to "UTF-16" as an encoding but fails to be clear that the
> working-tree-encoding value "UTF-16" is now only for UTF-16BE with
> BOM.  It would be easy to assume that the working-tree-encoding value
> "UTF-16" meant any UTF-16 file with a BOM (either LE or BE), which was
> the original meaning of this value before UTF-16LE-BOM was added to
> Git.

As I said, your statement isn't correct.  That's what libiconv does on
Windows. On Linux, glibc uses a little-endian variant with BOM on
little-endian machines.  musl, if memory serves me, always uses
big-endian without a BOM.  All of those are valid encodings, and a
UTF-16 reader must handle all of them.

> Finally, I am not sure how to use git add --renormalize to correct a
> UTF-16 file that was previously added incorrectly (i.e. with a missing
> or incorrect working-tree-encoding entry in .gitattributes).  The git
> add documentation at https://git-scm.com/docs/git-add implies
> 'renormalize' resets only the end-of-line values; however, I suspect
> it also re-converts text encoding when a working-tree-encoding
> property is set.  It would be helpful to know one way or the other.

It does indeed affect the working-tree-encoding.  If you wanted to send
an inline patch created with git format-patch, it would probably be
welcome to mention that.  However, because in this project we typically
scratch our own itch, if you don't send one, it's likely nobody else
will, either.

[0] https://datatracker.ietf.org/doc/html/rfc2781 § 4.1: “All
    applications that process text with the "UTF-16" charset label
    MUST be able to interpret both big- endian and little-endian text.”
[1] https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_character_encoding?view=powershell-7.4
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA
Attachment:
signature.asc

Description: PGP signature