On Wed, Jan 30, 2019 at 10:24:44AM -0500, Jason Pyeron wrote: > > -----Original Message----- > > From: git-owner@xxxxxxxxxxxxxxx <git-owner@xxxxxxxxxxxxxxx> On Behalf Of > > tboegi@xxxxxx > > Sent: Wednesday, January 30, 2019 10:02 AM > > To: git@xxxxxxxxxxxxxxx; adrigibal@xxxxxxxxx > > Cc: Torsten Bögershausen <tboegi@xxxxxx> > > Subject: [PATCH v3 1/1] Support working-tree-encoding "UTF-16LE-BOM" > > > > From: Torsten Bögershausen <tboegi@xxxxxx> > > > > Users who want UTF-16 files in the working tree set the .gitattributes > > like this: > > test.txt working-tree-encoding=UTF-16 > > > > The unicode standard itself defines 3 allowed ways how to encode UTF-16. > > The following 3 versions convert all back to 'g' 'i' 't' in UTF-8: > > > > a) UTF-16, without BOM, big endian: > > $ printf "\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c > > 0000000 g i t > > > > b) UTF-16, with BOM, little endian: > > $ printf "\377\376g\000i\000t\000" | iconv -f UTF-16 -t UTF-8 | od -c > > 0000000 g i t > > > > c) UTF-16, with BOM, big endian: > > $ printf "\376\377\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c > > 0000000 g i t > > > > Git uses libiconv to convert from UTF-8 in the index into ITF-16 in the > > working tree. > > After a checkout, the resulting file has a BOM and is encoded in "UTF-16", > > in the version (c) above. > > This is what iconv generates, more details follow below. > > > > iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE: > > > > d) UTF-16 > > $ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c > > 0000000 376 377 \0 g \0 i \0 t > > > > e) UTF-16LE > > $ printf 'git' | iconv -f UTF-8 -t UTF-16LE | od -c > > 0000000 g \0 i \0 t \0 > > > > f) UTF-16BE > > $ printf 'git' | iconv -f UTF-8 -t UTF-16BE | od -c > > 0000000 \0 g \0 i \0 t > > > > There is no way to generate version (b) from above in a Git working tree, > > but that is what some applications need. > > (All fully unicode aware applications should be able to read all 3 > > variants, > > but in practise we are not there yet). > > > > When producing UTF-16 as an output, iconv generates the big endian version > > with a BOM. (big endian is probably chosen for historical reasons). > > > > iconv can produce UTF-16 files with little endianess by using "UTF-16LE" > > as encoding, and that file does not have a BOM. > > > > Not all users (especially under Windows) are happy with this. > > Some tools are not fully unicode aware and can only handle version (b). > > > > Today there is no way to produce version (b) with iconv (or libiconv). > > Looking into the history of iconv, it seems as if version (c) will > > be used in all future iconv versions (for compatibility reasons). > > > Reading the RFC 2781 section 3.3: > > Text in the "UTF-16BE" charset MUST be serialized with the octets > which make up a single 16-bit UTF-16 value in big-endian order. > Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text. > > Text in the "UTF-16LE" charset MUST be serialized with the octets > which make up a single 16-bit UTF-16 value in little-endian order. > Systems labelling UTF-16LE text MUST NOT prepend a BOM to the text. > > I opened a bug with libiconv... https://savannah.gnu.org/bugs/index.php?55609 > UTF-16 may be a), b) or c) from above. Every unicode compliant system should be able to read all 3 of them. When writing, the system/application/converter is free to choose one of those. Probably out of historical reason, big endian is preferred (in iconv), and to be helpful to systems/applications a BOM is written in the beginning. This is according to the RFC, why do you think that this is a bug ?