Re: [PATCH 0/2] Improve documentation on UTF-16

Johannes Sixt <j6t@xxxxxxxx> · Thu, 27 Dec 2018 11:06:17 +0100

Am 27.12.18 um 03:17 schrieb brian m. carlson:
We've recently fielded several reports from unhappy Windows users about
our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be
suitable for certain Windows programs.

In an effort to communicate the reasons for our behavior more
effectively, explain in the documentation that the UTF-16 variant that
people have been asking for hasn't been standardized, and therefore
hasn't been implemented in iconv(3). Mention what each of the variants
do, so that people can make a decision which one meets their needs the
best.

In addition, add a comment in the code about why we must, for
correctness reasons, reject a UTF-16LE or UTF-16BE sequence that begins
with U+FEFF, namely that such a codepoint semantically represents a
ZWNBSP, not a BOM, but that that codepoint at the beginning of a UTF-8
sequence (as encoded in the object store) would be misinterpreted as a
BOM instead.

This comment is in the code because I think it needs to be somewhere,
but I'm not sure the documentation is the right place for it. If
desired, I can add it to the documentation, although I feel the lurid
details are not interesting to most users. If the wording is confusing,
I'm very open to hearing suggestions for how to improve it.

I don't use Windows, so I don't know what MSVCRT does. If it requires a
BOM but doesn't accept big-endian encoding, then perhaps we should
report that as a bug to Microsoft so it can be fixed in a future
version. That would probably make a lot more programs work right out of
the box and dramatically improve the user experience.

It worries me that theoretical correctness is regarded higher than 
existing practice. I do not care a lot what some RFC tells what programs 
should do if the majority of the software does something different and 
that behavior has been proven useful in practice.

My understanding is that there is no such thing as a "byte order 
marker". It just so happens that when the first character in some UTF-16 
text file begins with a ZWNBSP, then it is possible to derive the 
endianness of the file automatically. Other then that, that very first 
code point U+FEFF *is part of the data* and must not be removed when the 
data is reencoded. If Git does something different, it is bogus, IMO.

-- Hannes