On Mon, Dec 11, 2017 at 10:50 AM, <lars.schneider@xxxxxxxxxxxx> wrote: > From: Lars Schneider <larsxschneider@xxxxxxxxx> > > Git and its tools (e.g. git diff) expect all text files in UTF-8 > encoding. Git will happily accept content in all other encodings, too, > but it might not be able to process the text (e.g. viewing diffs or > changing line endings). > > Add an attribute to tell Git what encoding the user has defined for a > given file. If the content is added to the index, then Git converts the > content to a canonical UTF-8 representation. On checkout Git will > reverse the conversion. > > Reviewed-by: Patrick Lühne <patrick@xxxxxxxxx> > Signed-off-by: Lars Schneider <larsxschneider@xxxxxxxxx> > --- > diff --git a/convert.c b/convert.c > @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats, > +static int encode_to_git(const char *path, const char *src, size_t src_len, > + struct strbuf *buf, struct encoding *enc) > +{ > +#ifndef NO_ICONV > + char *dst, *re_src; > + int dst_len, re_src_len; > + > + /* > + * No encoding is specified or there is nothing to encode. > + * Tell the caller that the content was not modified. > + */ > + if (!enc || (src && !src_len)) > + return 0; > + > + /* > + * Looks like we got called from "would_convert_to_git()". > + * This means Git wants to know if it would encode (= modify!) > + * the content. Let's answer with "yes", since an encoding was > + * specified. > + */ > + if (!buf && !src) > + return 1; > + > + if (enc->to_git == invalid_conversion) { > + enc->to_git = iconv_open(default_encoding, encoding->name); > + if (enc->to_git == invalid_conversion) > + warning(_("unsupported encoding %s"), encoding->name); > + } > + > + if (enc->to_worktree == invalid_conversion) > + enc->to_worktree = iconv_open(encoding->name, default_encoding); Do you need to be calling iconv_close() somewhere on the result of the iconv_open() calls? [Answering myself after reading the rest of the patch: You're caching these opened 'iconv' descriptors, so you don't plan on closing them.] > + [...] > + /* > + * Encode dst back to ensure no information is lost. This wastes > + * a few cycles as most conversions are round trip conversion > + * safe. However, content that has an invalid encoding might not > + * match its original byte sequence after the UTF-8 conversion > + * round trip. Let's play safe here and check the round trip > + * conversion. > + */ > + re_src = reencode_string_iconv(dst, dst_len, enc->to_worktree, &re_src_len); > + if (!re_src || strcmp(src, re_src)) { You're using strcmp() as opposed to memcmp() because you expect 're_src' will unconditionally be UTF-8-encoded, right? > + die(_("encoding '%s' from %s to %s and back is not the same"), > + path, enc->name, default_encoding); > + } > + free(re_src); > + > + strbuf_attach(buf, dst, dst_len, dst_len + 1); > + return 1; > +#else > + warning(_("cannot encode '%s' from %s to %s because " > + "your Git was not compiled with encoding support"), > + path, enc->name, default_encoding); > + return 0; > +#endif > +}