Re: [PATCH 0/2] Improve documentation on UTF-16

"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> · Thu, 27 Dec 2018 16:43:53 +0000

On Thu, Dec 27, 2018 at 11:06:17AM +0100, Johannes Sixt wrote:
> It worries me that theoretical correctness is regarded higher than existing
> practice. I do not care a lot what some RFC tells what programs should do if
> the majority of the software does something different and that behavior has
> been proven useful in practice.

The majority of OSes produce the behavior I document here, and they are
the majority of systems on the Internet. Windows is the outlier here,
although a significant one. It is a common user of UTF-16 and its
variants, but so are Java and JavaScript, and they're present on a lot
of devices. Swallowing the U+FEFF would break compatibility with those
systems.

The issue that Windows users are seeing is that libiconv always produces
big-endian data for UTF-16, and they always want little-endian. glibc
produces native-endian data, which is what Windows users want. Git for
Windows could patch libiconv to do that (and that is the simple,
five-minute solution to this problem), but we'd still want to warn
people that they're relying on unspecified behavior, hence this series.

I would even be willing to patch Git for Windows's libiconv if somebody
could point me to the repo (although I obviously cannot test it, not
being a Windows user). I feel strongly, though, that fixing this is
outside of the scope of Git proper, and it's not a thing we should be
handling here.

> My understanding is that there is no such thing as a "byte order marker". It
> just so happens that when the first character in some UTF-16 text file
> begins with a ZWNBSP, then it is possible to derive the endianness of the
> file automatically. Other then that, that very first code point U+FEFF *is
> part of the data* and must not be removed when the data is reencoded. If Git
> does something different, it is bogus, IMO.

You've got part of this. For UTF-16LE and UTF-16BE, a U+FEFF is part of
the text, as would a second one be if we had two at the beginning of a
UTF-16 or UTF-8 sequence. If someone produces UTF-16LE and places a
U+FEFF at the beginning of it, when we encode to UTF-8, we emit only one
U+FEFF, which has the wrong semantics.

To be correct here and accept a U+FEFF, we'd need to check for a U+FEFF
at the beginning of a UTF-16LE or UTF-16BE sequence and ensure we encode
an extra U+FEFF at the beginning of the UTF-8 data (one for BOM and one
for the text) and then strip it off when we decode. That's kind of ugly,
and since iconv doesn't do that itself, we'd have to.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204
Attachment:
signature.asc

Description: PGP signature