Re: [PATCH v2 1/1] Support working-tree-encoding "UTF-16LE-BOM"

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 22 Jan 2019 12:13:48 -0800

tboegi@xxxxxx writes:

> The unicode standard itself defines 3 possible ways how to encode UTF-16.
> a) UTF-16, without BOM, big endian:
> b) UTF-16, with BOM, little endian:
> c) UTF-16, with BOM, big endian:

Is it OK to interpret "possible" as "allowed" above?

> iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE:
>
> d) UTF-16
> $ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c
> 0000000  376 377  \0   g  \0   i  \0   t

So among three, encoder can only do "big endian with BOM" (c).

Lack of (a) "big endian without BOM" in the encoder is not a problem
in practice, as you can ask UTF-16BE to produce the stream, tell the
decoder that you have UTF-16 and the lack of the BOM would make the
decoder take it as (a).

But lack of (b) "little endian with BOM" is a problem.

So the proposal is to invent UTF-16-[BL]E-BOM that prepends BOM in
front of UTF-16-[BL]E output to allow those who want (b).

Which makes sense, I guess.  I do find it a bit ugly in the sense
that it is something iconv should learn to do, as the issue is
shared with all applications that want to use libiconv and convert
into UTF-16.

Do you add UTF-16-BE-BOM for consistency?  It would be identical to
telling iconv to encode to UTF-16, if I understood your problem
description correctly.

> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index b8392fc330..4a88ab8be7 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -343,13 +343,13 @@ automatic line ending conversion based on your platform.
>  ------------------------
>
>  Use the following attributes if your '*.ps1' files are UTF-16 little
> -endian encoded without BOM and you want Git to use Windows line endings
> +endian encoded with BOM and you want Git to use Windows line endings
>  in the working directory. Please note, it is highly recommended to
>  explicitly define the line endings with `eol` if the `working-tree-encoding`
>  attribute is used to avoid ambiguity.
>
>  ------------------------
> -*.ps1		text working-tree-encoding=UTF-16LE eol=CRLF
> +*.ps1		text working-tree-encoding=UTF-16LE-BOM eol=CRLF
>  ------------------------

This change is robbing from those who do want a file without BOM to
give to those who do want a file with BOM.  Are the latter class of
people the majority of the intended readers (read: Windows folks)?

I wonder if the following, instead of the above hunk, would work better:

 endian encoded without BOM and you want Git to use Windows line endings
-in the working directory. Please note, it is highly recommended to
+in the working directory (use `UTF-16-LE-BOM` instead of `UTF-16LE` if
+you want UTF-16 little endian with BOM).
+Please note, it is highly recommended to
 explicitly define the line endings with `eol` if the `working-tree-encoding`

> @@ -540,10 +546,30 @@ char *reencode_string_len(const char *in, size_t insz,
>  {
>  	iconv_t conv;
>  	char *out;
> +	const char *bom_str = NULL;
> +	size_t bom_len = 0;
>
>  	if (!in_encoding)
>  		return NULL;
>
> +	/* UTF-16LE-BOM is the same as UTF-16 for reading */
> +	if (same_utf_encoding("UTF-16LE-BOM", in_encoding))
> +		in_encoding = "UTF-16";
> +
> +	/*
> +	 * For writing, UTF-16 iconv typically creates "UTF-16BE-BOM"
> +	 * Some users under Windows want the little endian version
> +	 */
> +	if (same_utf_encoding("UTF-16LE-BOM", out_encoding)) {
> +		bom_str = utf16_le_bom;
> +		bom_len = sizeof(utf16_le_bom);
> +		out_encoding = "UTF-16LE";
> +	} else if (same_utf_encoding("UTF-16BE-BOM", out_encoding)) {
> +		bom_str = utf16_be_bom;
> +		bom_len = sizeof(utf16_be_bom);
> +		out_encoding = "UTF-16BE";

OK, you do allow BE-BOM and the code does not rely on the fact that
iconv happens to produce it with "UTF-16", because the library is
free to switch between the three possible output (a)-(c) and we do
not want to get affected by such a switch.  Makes sense.