Re: [PATCH 0/2] Improve documentation on UTF-16

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 27.12.18 um 03:17 schrieb brian m. carlson:
We've recently fielded several reports from unhappy Windows users about
our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be
suitable for certain Windows programs.

In an effort to communicate the reasons for our behavior more
effectively, explain in the documentation that the UTF-16 variant that
people have been asking for hasn't been standardized, and therefore
hasn't been implemented in iconv(3). Mention what each of the variants
do, so that people can make a decision which one meets their needs the
best.

In addition, add a comment in the code about why we must, for
correctness reasons, reject a UTF-16LE or UTF-16BE sequence that begins
with U+FEFF, namely that such a codepoint semantically represents a
ZWNBSP, not a BOM, but that that codepoint at the beginning of a UTF-8
sequence (as encoded in the object store) would be misinterpreted as a
BOM instead.

This comment is in the code because I think it needs to be somewhere,
but I'm not sure the documentation is the right place for it. If
desired, I can add it to the documentation, although I feel the lurid
details are not interesting to most users. If the wording is confusing,
I'm very open to hearing suggestions for how to improve it.

I don't use Windows, so I don't know what MSVCRT does. If it requires a
BOM but doesn't accept big-endian encoding, then perhaps we should
report that as a bug to Microsoft so it can be fixed in a future
version. That would probably make a lot more programs work right out of
the box and dramatically improve the user experience.

It worries me that theoretical correctness is regarded higher than existing practice. I do not care a lot what some RFC tells what programs should do if the majority of the software does something different and that behavior has been proven useful in practice.

My understanding is that there is no such thing as a "byte order marker". It just so happens that when the first character in some UTF-16 text file begins with a ZWNBSP, then it is possible to derive the endianness of the file automatically. Other then that, that very first code point U+FEFF *is part of the data* and must not be removed when the data is reencoded. If Git does something different, it is bogus, IMO.

-- Hannes



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux