Am 27.12.18 um 03:17 schrieb brian m. carlson:
We've recently fielded several reports from unhappy Windows users about our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be suitable for certain Windows programs. In an effort to communicate the reasons for our behavior more effectively, explain in the documentation that the UTF-16 variant that people have been asking for hasn't been standardized, and therefore hasn't been implemented in iconv(3). Mention what each of the variants do, so that people can make a decision which one meets their needs the best. In addition, add a comment in the code about why we must, for correctness reasons, reject a UTF-16LE or UTF-16BE sequence that begins with U+FEFF, namely that such a codepoint semantically represents a ZWNBSP, not a BOM, but that that codepoint at the beginning of a UTF-8 sequence (as encoded in the object store) would be misinterpreted as a BOM instead. This comment is in the code because I think it needs to be somewhere, but I'm not sure the documentation is the right place for it. If desired, I can add it to the documentation, although I feel the lurid details are not interesting to most users. If the wording is confusing, I'm very open to hearing suggestions for how to improve it. I don't use Windows, so I don't know what MSVCRT does. If it requires a BOM but doesn't accept big-endian encoding, then perhaps we should report that as a bug to Microsoft so it can be fixed in a future version. That would probably make a lot more programs work right out of the box and dramatically improve the user experience.
It worries me that theoretical correctness is regarded higher than existing practice. I do not care a lot what some RFC tells what programs should do if the majority of the software does something different and that behavior has been proven useful in practice.
My understanding is that there is no such thing as a "byte order marker". It just so happens that when the first character in some UTF-16 text file begins with a ZWNBSP, then it is possible to derive the endianness of the file automatically. Other then that, that very first code point U+FEFF *is part of the data* and must not be removed when the data is reencoded. If Git does something different, it is bogus, IMO.
-- Hannes