Re: [PATCH v2 4/5] convert: add support for 'checkout-encoding' attribute

Lars Schneider <larsxschneider@xxxxxxxxx> · Sat, 30 Dec 2017 20:58:02 +0100

> On 29 Dec 2017, at 18:28, Torsten Bögershausen <tboegi@xxxxxx> wrote:
> 
> I probably need to look at convert.c more closer, some other comments inline.
> 
> 
> On Fri, Dec 29, 2017 at 04:22:21PM +0100, lars.schneider@xxxxxxxxxxxx wrote:
>> From: Lars Schneider <larsxschneider@xxxxxxxxx>
>> 
>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>> encoding. Git will happily accept content in all other encodings, too,
>> but it might not be able to process the text (e.g. viewing diffs or
>> changing line endings).
> 
> UTF-8 is too strict, the text from below is more correct:
> +Git recognizes files encoded with ASCII or one of its supersets (e.g.
> +UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
> +interpreted as binary and consequently built-in Git text processing
> +tools (e.g. 'git diff') as well as most Git web front ends do not
> +visualize the content.

Agreed. I'll replace it.

>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
> 
> Minor question about "canonical":
> Would this mean the same ?
> ...then Git converts the content into  UTF-8.

I feel the word canonical makes sense in this context to express
that UTF-8 is not some randomly chosen encoding. AFAIK UTF-8 and 
LF are the canonical form for text in Git, no?

>> +In these cases you can teach Git the encoding of a file in the working
>                          teach ? tell ? 

"tell", agreed!

>> +directory with the `checkout-encoding` attribute. If a file with this
>> +attributes is added to Git, then Git reencodes the content from the
>> +specified encoding to UTF-8 and stores the result in its internal data
>> +structure.
> 
> Minor Q:
>> its internal data structure.
> Should we simply write "stores the result in the index" ?

This text is targeted at the end user and I know a lot of end users
that are confused by the word "index". How about:

  ...stores the result in its internal data structure ("the index").

Would that work?

> 
>> On checkout the content is encoded back to the specified
>> +encoding.
> 
>> +
>> +Please note that using the `checkout-encoding` attribute has a number
>> +of drawbacks:
>> +
>> +- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
>> +  errors as the conversion might not be round trip safe.
>> +
>> +- Reencoding content requires resources that might slow down certain
>> +  Git operations (e.g 'git checkout' or 'git add').
>> +
>> +- Git clients that do not support the `checkout-encoding` attribute or
>> +  the used encoding will checkout the respective files as UTF-8 encoded.
>> +  That means the content appears to be different which could cause
>> +  trouble. Affected clients are older Git versions and alternative Git
>> +  implementations such as JGit or libgit2 (as of January 2018).
>> +
>> +Use the `checkout-encoding` attribute only if you cannot store a file in
>> +UTF-8 encoding and if you want Git to be able to process the content as
>> +text.
>> +
> 
> I would maybe rephrase a little bit (first things first):
> 
> Please note that using the `checkout-encoding` attribute may have a number
> of pitfalls:
> 
> - Git clients that do not support the `checkout-encoding` attribute
>  will checkout the respective files as UTF-8 encoded,  which typically
>  causes trouble.
>  Escpecialy when older Git versions are used or alternative Git
>  implementations such as JGit or libgit2 (as of January 2018).
> 
> - Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
>  errors as the conversion might not be round trip safe.
> 
> - Reencoding content requires resources that might slow down certain
>  Git operations (e.g 'git checkout' or 'git add').
> 
> Use the `checkout-encoding` attribute only if you cannot store a file in
> UTF-8 encoding and if you want Git to be able to process the content as
> text.

Agreed!

> -----
> Side question: What happens if "the used encoding" is not supported by
> the underlying iconv lib ?
> Will Git fail, delete the file from the working tree ?
> That may be worth to mention. (And I need to do the code-review)

It should checkout the file in UTF-8 and print an error.
If you would try to add a file with an unsupported encoding,
then Git should die() and refuse the operation.
(see t0028: check unsupported encoding)

> 
>> +Use the following attributes if your '*.txt' files are UTF-16 encoded
>> +with byte order mark (BOM) and you want Git to perform automatic line
>> +ending conversion based on your platform.
>> +
>> +------------------------
>> +*.txt		text checkout-encoding=UTF-16
>> +------------------------
>> +
>> +Use the following attributes if your '*.txt' files are UTF-16 little
>> +endian encoded without BOM and you want Git to use Windows line endings
>> +in the working directory.
>> +
>> +------------------------
>> +*.txt 		checkout-encoding=UTF-16LE text eol=CRLF
>> +------------------------
>> +
>> +You can get a list of all available encodings on your platform with the
>> +following command:
>> +
>> +------------------------
>> +iconv --list
>> +------------------------
>> +
>> +
>> `ident`
>> ^^^^^^^
>> 
>> diff --git a/apply.c b/apply.c
>> index 321a9fa68d..c4bd5cf1f2 100644
>> --- a/apply.c
>> +++ b/apply.c
>> @@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch *patch,
>> 		 * should never look at the index when explicit crlf option
>> 		 * is given.
>> 		 */
>> -		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
>> +		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf, 0);
> 
> Hm. Do we really need another parameter here?
> The only caller that needs it (and sets it to 1) is sha1_file.c.
> When an invalid encoding/content is used, it should be safe to die() always?
> Just loud thinking..

Unfortunately it is not safe to die() always. Peff described a situation here
where you could not "checkout away" from an error state:
https://public-inbox.org/git/20171215095838.GA3567@xxxxxxxxxxxxxxxxxxxxx/

> If really needed, it may need less changes in the code base, if a new function
> convert_to_git_or_die() is defined, which is called by convert_to_git()
> 
> (and the other alternative would be to convert safe_crlf into a bitmap,
> see https://public-inbox.org/git/20171229132828.17594-1-tboegi@xxxxxx/T/#u
> what do you think ?)

I still need to review your patch, but a bitmap sounds like a good idea.
But why do we need a special bitmap? Can't we just pass "flags" and
calculate get_safe_crlf() in convert.c?

>> 
>> +
>> +	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
>> +		const char *error_msg = _(
>> +			"BOM is prohibited for '%s' if encoded as %s");
>> +		const char *advise_msg = _(
>> +			"You told Git to treat '%s' as %s. A byte order mark "
>> +			"(BOM) is prohibited with this encoding. Either use "
>> +			"%.6s as checkout encoding or remove the BOM from the "
>> +			"file.");
>> +
>> +		advise(advise_msg, path, enc->name, enc->name, enc->name);
>> +		if (write_obj)
>> +			die(error_msg, path, enc->name);
>> +		else
>> +			error(error_msg, path, enc->name);
> 
> As said before, I would just die(). Or do I miss something ?
> Being strict with BOMs seams like a good idea.

See above, Peffs test case. We run convert in a lot of places
(e.g. 'git diff' or if checkout away from a bad state).

>> 
>> diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-checkout-encoding.sh
>> new file mode 100755
>> index 0000000000..1a329ab933
>> --- /dev/null
>> +++ b/t/t0028-checkout-encoding.sh
>> @@ -0,0 +1,197 @@
>> +#!/bin/sh
>> +
>> +test_description='checkout-encoding conversion via gitattributes'
>> +
>> +. ./test-lib.sh
>> +
>> +test_expect_success 'setup test repo' '
>> +
>> +	text="hallo there!\ncan you read me?" &&
> 
> Is this portable ? (the "\n")

I assume it as it is used in t0001, t0008, t0024 and others.

Thanks for the review,
Lars