tboegi@xxxxxx writes: > The unicode standard itself defines 3 possible ways how to encode UTF-16. > a) UTF-16, without BOM, big endian: > b) UTF-16, with BOM, little endian: > c) UTF-16, with BOM, big endian: Is it OK to interpret "possible" as "allowed" above? > iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE: > > d) UTF-16 > $ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c > 0000000 376 377 \0 g \0 i \0 t So among three, encoder can only do "big endian with BOM" (c). Lack of (a) "big endian without BOM" in the encoder is not a problem in practice, as you can ask UTF-16BE to produce the stream, tell the decoder that you have UTF-16 and the lack of the BOM would make the decoder take it as (a). But lack of (b) "little endian with BOM" is a problem. So the proposal is to invent UTF-16-[BL]E-BOM that prepends BOM in front of UTF-16-[BL]E output to allow those who want (b). Which makes sense, I guess. I do find it a bit ugly in the sense that it is something iconv should learn to do, as the issue is shared with all applications that want to use libiconv and convert into UTF-16. Do you add UTF-16-BE-BOM for consistency? It would be identical to telling iconv to encode to UTF-16, if I understood your problem description correctly. > diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt > index b8392fc330..4a88ab8be7 100644 > --- a/Documentation/gitattributes.txt > +++ b/Documentation/gitattributes.txt > @@ -343,13 +343,13 @@ automatic line ending conversion based on your platform. > ------------------------ > > Use the following attributes if your '*.ps1' files are UTF-16 little > -endian encoded without BOM and you want Git to use Windows line endings > +endian encoded with BOM and you want Git to use Windows line endings > in the working directory. Please note, it is highly recommended to > explicitly define the line endings with `eol` if the `working-tree-encoding` > attribute is used to avoid ambiguity. > > ------------------------ > -*.ps1 text working-tree-encoding=UTF-16LE eol=CRLF > +*.ps1 text working-tree-encoding=UTF-16LE-BOM eol=CRLF > ------------------------ This change is robbing from those who do want a file without BOM to give to those who do want a file with BOM. Are the latter class of people the majority of the intended readers (read: Windows folks)? I wonder if the following, instead of the above hunk, would work better: endian encoded without BOM and you want Git to use Windows line endings -in the working directory. Please note, it is highly recommended to +in the working directory (use `UTF-16-LE-BOM` instead of `UTF-16LE` if +you want UTF-16 little endian with BOM). +Please note, it is highly recommended to explicitly define the line endings with `eol` if the `working-tree-encoding` > @@ -540,10 +546,30 @@ char *reencode_string_len(const char *in, size_t insz, > { > iconv_t conv; > char *out; > + const char *bom_str = NULL; > + size_t bom_len = 0; > > if (!in_encoding) > return NULL; > > + /* UTF-16LE-BOM is the same as UTF-16 for reading */ > + if (same_utf_encoding("UTF-16LE-BOM", in_encoding)) > + in_encoding = "UTF-16"; > + > + /* > + * For writing, UTF-16 iconv typically creates "UTF-16BE-BOM" > + * Some users under Windows want the little endian version > + */ > + if (same_utf_encoding("UTF-16LE-BOM", out_encoding)) { > + bom_str = utf16_le_bom; > + bom_len = sizeof(utf16_le_bom); > + out_encoding = "UTF-16LE"; > + } else if (same_utf_encoding("UTF-16BE-BOM", out_encoding)) { > + bom_str = utf16_be_bom; > + bom_len = sizeof(utf16_be_bom); > + out_encoding = "UTF-16BE"; OK, you do allow BE-BOM and the code does not rely on the fact that iconv happens to produce it with "UTF-16", because the library is free to switch between the three possible output (a)-(c) and we do not want to get affected by such a switch. Makes sense.