Re: [PATCH v1] convert: add support for 'encoding' attribute

Lars Schneider <larsxschneider@xxxxxxxxx> · Tue, 12 Dec 2017 00:47:59 +0100

On 11 Dec 2017, at 19:39, Eric Sunshine <sunshine@xxxxxxxxxxxxxx> wrote:

> On Mon, Dec 11, 2017 at 10:50 AM,  <lars.schneider@xxxxxxxxxxxx> wrote:
>> From: Lars Schneider <larsxschneider@xxxxxxxxx>
>> 
>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>> encoding. Git will happily accept content in all other encodings, too,
>> but it might not be able to process the text (e.g. viewing diffs or
>> changing line endings).
>> 
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
>> 
>> Reviewed-by: Patrick Lühne <patrick@xxxxxxxxx>
>> Signed-off-by: Lars Schneider <larsxschneider@xxxxxxxxx>
>> ---
>> diff --git a/convert.c b/convert.c
>> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>> +                        struct strbuf *buf, struct encoding *enc)
>> +{
>> +#ifndef NO_ICONV
>> +       char *dst, *re_src;
>> +       int dst_len, re_src_len;
>> +
>> +       /*
>> +        * No encoding is specified or there is nothing to encode.
>> +        * Tell the caller that the content was not modified.
>> +        */
>> +       if (!enc || (src && !src_len))
>> +               return 0;
>> +
>> +       /*
>> +        * Looks like we got called from "would_convert_to_git()".
>> +        * This means Git wants to know if it would encode (= modify!)
>> +        * the content. Let's answer with "yes", since an encoding was
>> +        * specified.
>> +        */
>> +       if (!buf && !src)
>> +               return 1;
>> +
>> +       if (enc->to_git == invalid_conversion) {
>> +               enc->to_git = iconv_open(default_encoding, encoding->name);
>> +               if (enc->to_git == invalid_conversion)
>> +                       warning(_("unsupported encoding %s"), encoding->name);
>> +       }
>> +
>> +       if (enc->to_worktree == invalid_conversion)
>> +               enc->to_worktree = iconv_open(encoding->name, default_encoding);
> 
> Do you need to be calling iconv_close() somewhere on the result of the
> iconv_open() calls? [Answering myself after reading the rest of the
> patch: You're caching these opened 'iconv' descriptors, so you don't
> plan on closing them.]

Should this information go into the commit message to avoid confusing
future readers? I think, yes.

>> + [...]
>> +       /*
>> +        * Encode dst back to ensure no information is lost. This wastes
>> +        * a few cycles as most conversions are round trip conversion
>> +        * safe. However, content that has an invalid encoding might not
>> +        * match its original byte sequence after the UTF-8 conversion
>> +        * round trip. Let's play safe here and check the round trip
>> +        * conversion.
>> +        */
>> +       re_src = reencode_string_iconv(dst, dst_len, enc->to_worktree, &re_src_len);
>> +       if (!re_src || strcmp(src, re_src)) {
> 
> You're using strcmp() as opposed to memcmp() because you expect
> 're_src' will unconditionally be UTF-8-encoded, right?

Yes. But since you mention it, I think it would be better to use
memcmp() here! Plus, it might be a tiny bit faster ;-)

Thanks,
Lars