Re: [PATCH 0/2] Improve documentation on UTF-16

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 27.12.18 um 17:43 schrieb brian m. carlson:
On Thu, Dec 27, 2018 at 11:06:17AM +0100, Johannes Sixt wrote:
It worries me that theoretical correctness is regarded higher than existing
practice. I do not care a lot what some RFC tells what programs should do if
the majority of the software does something different and that behavior has
been proven useful in practice.

The majority of OSes produce the behavior I document here, and they are
the majority of systems on the Internet. Windows is the outlier here,
although a significant one. It is a common user of UTF-16 and its
variants, but so are Java and JavaScript, and they're present on a lot
of devices. Swallowing the U+FEFF would break compatibility with those
systems.

The issue that Windows users are seeing is that libiconv always produces
big-endian data for UTF-16, and they always want little-endian. glibc
produces native-endian data, which is what Windows users want. Git for
Windows could patch libiconv to do that (and that is the simple,
five-minute solution to this problem), but we'd still want to warn
people that they're relying on unspecified behavior, hence this series.

I would even be willing to patch Git for Windows's libiconv if somebody
could point me to the repo (although I obviously cannot test it, not
being a Windows user). I feel strongly, though, that fixing this is
outside of the scope of Git proper, and it's not a thing we should be
handling here.

Please appologize that I leave the majority of what you said uncommented as I am not deep in the matter and don't have a firm understanding of all the issues. I'll just trust what you said is sound.

Just one thing: Please do the count by *users* (or existing files or number of charactes exchanged or something similar); do not just count OSs; I mean, Windows is *not* the outlier if it handles 90% of the UTF-16 data in the world. (I'm just making up numbers here, but I think you get the point.)

My understanding is that there is no such thing as a "byte order marker". It
just so happens that when the first character in some UTF-16 text file
begins with a ZWNBSP, then it is possible to derive the endianness of the
file automatically. Other then that, that very first code point U+FEFF *is
part of the data* and must not be removed when the data is reencoded. If Git
does something different, it is bogus, IMO.

You've got part of this. For UTF-16LE and UTF-16BE, a U+FEFF is part of
the text, as would a second one be if we had two at the beginning of a
UTF-16 or UTF-8 sequence. If someone produces UTF-16LE and places a
U+FEFF at the beginning of it, when we encode to UTF-8, we emit only one
U+FEFF, which has the wrong semantics.

To be correct here and accept a U+FEFF, we'd need to check for a U+FEFF
at the beginning of a UTF-16LE or UTF-16BE sequence and ensure we encode
an extra U+FEFF at the beginning of the UTF-8 data (one for BOM and one
for the text) and then strip it off when we decode. That's kind of ugly,
and since iconv doesn't do that itself, we'd have to.

But why do you add another U+FEFF on the way to UTF-8? There is one in the incoming UTF-16 data, and only *that* one must be converted. If there is no U+FEFF in the UTF-16 data, the should not be one in UTF-8, either. Puzzled...

-- Hannes



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux