On Thu, Dec 27, 2018 at 11:06:17AM +0100, Johannes Sixt wrote: > It worries me that theoretical correctness is regarded higher than existing > practice. I do not care a lot what some RFC tells what programs should do if > the majority of the software does something different and that behavior has > been proven useful in practice. The majority of OSes produce the behavior I document here, and they are the majority of systems on the Internet. Windows is the outlier here, although a significant one. It is a common user of UTF-16 and its variants, but so are Java and JavaScript, and they're present on a lot of devices. Swallowing the U+FEFF would break compatibility with those systems. The issue that Windows users are seeing is that libiconv always produces big-endian data for UTF-16, and they always want little-endian. glibc produces native-endian data, which is what Windows users want. Git for Windows could patch libiconv to do that (and that is the simple, five-minute solution to this problem), but we'd still want to warn people that they're relying on unspecified behavior, hence this series. I would even be willing to patch Git for Windows's libiconv if somebody could point me to the repo (although I obviously cannot test it, not being a Windows user). I feel strongly, though, that fixing this is outside of the scope of Git proper, and it's not a thing we should be handling here. > My understanding is that there is no such thing as a "byte order marker". It > just so happens that when the first character in some UTF-16 text file > begins with a ZWNBSP, then it is possible to derive the endianness of the > file automatically. Other then that, that very first code point U+FEFF *is > part of the data* and must not be removed when the data is reencoded. If Git > does something different, it is bogus, IMO. You've got part of this. For UTF-16LE and UTF-16BE, a U+FEFF is part of the text, as would a second one be if we had two at the beginning of a UTF-16 or UTF-8 sequence. If someone produces UTF-16LE and places a U+FEFF at the beginning of it, when we encode to UTF-8, we emit only one U+FEFF, which has the wrong semantics. To be correct here and accept a U+FEFF, we'd need to check for a U+FEFF at the beginning of a UTF-16LE or UTF-16BE sequence and ensure we encode an extra U+FEFF at the beginning of the UTF-8 data (one for BOM and one for the text) and then strip it off when we decode. That's kind of ugly, and since iconv doesn't do that itself, we'd have to. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204
Attachment:
signature.asc
Description: PGP signature