On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schneider@xxxxxxxxxxxx wrote: > From: Lars Schneider <larsxschneider@xxxxxxxxx> > > Git and its tools (e.g. git diff) expect all text files in UTF-8 > encoding. Git will happily accept content in all other encodings, too, > but it might not be able to process the text (e.g. viewing diffs or > changing line endings). > > Add an attribute to tell Git what encoding the user has defined for a > given file. If the content is added to the index, then Git converts the > content to a canonical UTF-8 representation. On checkout Git will > reverse the conversion. > > Reviewed-by: Patrick Lühne <patrick@xxxxxxxxx> > Signed-off-by: Lars Schneider <larsxschneider@xxxxxxxxx> > --- > > Hi, > > here is a WIP patch to add text encoding support for files encoded with > something other than UTF-8 [RFC]. > > The 'encoding' attribute is already used to view blobs in gitk. That > could be a problem as the content is stored in Git with the defined > encoding. This patch would interpret the content as UTF-8 encoded and > it would try to reencode it to the defined encoding on checkout. > Plus, many repos define the attribute very broad (e.g. "* encoding=cp1251"). > These folks would see errors like these with my patch: > error: failed to encode 'foo.bar' from utf-8 to cp1251 > > A quick search on GitHub reveals 2,722 repositories that use the > 'encoding' attribute [1]. Using the GitHub API [2], I identified the > following encodings in all publicly accessible repositories: > > ANSI <-- garbage (ignore by my implementation) > cp1251 > cp866 > cp950 > iso8859-1 > koi8-r > shiftjis <-- garbage (ignore by my implementation) > UTF-8 <-- unnecessary (ignore by my implementation) > utf8 <-- garbage (ignore by my implementation) > > TODOs: > - The iconv docs mention that "errno == EINVAL" is harmless but > we don't handle that case in utf8.c:reencode_string_iconv() > - Git does not compile with NO_ICONV=1 right now because of > compat/precompose_utf8.c. I will send a patch to fix that. Minor: Does Git not compile under MacOS when setting NO_ICONV=1 ? This is possible not introduced by the patch. > > Questions: > - Should I use warning() or error() in the patch? > (currently I use the warning) Errors, see below. > - Do you agree with the approach in general? Yes, some comments inline. > > Thanks, > Lars > > > [RFC] http://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@xxxxxxxxx > [1] https://github.com/search?p=1&q=encoding+filename%3Agitattributes&type=Code&utf8=%E2%9C%93 > [2] curl --user larsxschneider:secret -H 'Accept: application/vnd.github.v3.text-match+json' 'https://api.github.com/search/code?q=encoding+in:file+filename:gitattributes' | jq -r '.items[].text_matches[].fragment' | sed 's/.*encoding=//' | sort | uniq > [3] https://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html#iconv-Examples > > > > > Notes: > Base Ref: master > Web-Diff: https://github.com/larsxschneider/git/commit/afc9e88a4d > Checkout: git fetch https://github.com/larsxschneider/git encoding-v1 && git checkout afc9e88a4d > > Documentation/gitattributes.txt | 27 ++++++ > convert.c | 192 +++++++++++++++++++++++++++++++++++++++- > t/t0028-encoding.sh | 60 +++++++++++++ > 3 files changed, 278 insertions(+), 1 deletion(-) > create mode 100755 t/t0028-encoding.sh > > diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt > index 30687de81a..84de2fa49c 100644 > --- a/Documentation/gitattributes.txt > +++ b/Documentation/gitattributes.txt > @@ -146,6 +146,33 @@ Unspecified:: > Any other value causes Git to act as if `text` has been left > unspecified. > > +`encoding` > +^^^^^^^^^^ > + > +By default Git assumes UTF-8 encoding for text files. The `encoding` This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8. > +attribute sets the encoding to be used in the working directory. > +If the path is added to the index, then Git encodes the content to > +UTF-8. On checkout the content is encoded back to the original > +encoding. Consequently, you can use all built-in Git text processing > +tools (e.g. git diff, line ending conversions, etc.) with your > +non-UTF-8 encoded file. > + > +Please note that re-encoding content can cause errors and requires > +resources. Use the `encoding` attribute only if you cannot store > +a file in UTF-8 encoding and if you want Git to be able to process > +the text. > + > +------------------------ > +*.txt text encoding=UTF-16 > +------------------------ I think that encoding (or worktree-encoding as said elsewhere) must imply text. (That is not done in convert.c, but assuming binary or even auto does not make much sense to me) > + > +All `iconv` encodings with a stable round-trip conversion to and from > +UTF-8 are supported. You can see a full list with the following command: > + > +------------------------ > +iconv --list > +------------------------ > + > `eol` > ^^^^^ > > diff --git a/convert.c b/convert.c > index 20d7ab67bd..ee19c17104 100644 > --- a/convert.c > +++ b/convert.c > @@ -7,6 +7,7 @@ > #include "sigchain.h" > #include "pkt-line.h" > #include "sub-process.h" > +#include "utf8.h" > > /* > * convert.c - convert a file when checking it out and checking it in. > @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats, > > } I would avoid to use these #ifdefs here. All functions can be in strbuf.c (and may have #ifdefs there), but not here. > > +#ifdef NO_ICONV > +#ifndef _ICONV_T > +/* The type is just a placeholder and not actually used. */ > +typedef void* iconv_t; > +#endif > +#endif > + > +static struct encoding { > + const char *name; > + iconv_t to_git; /* conversion to Git's canonical encoding (UTF-8) */ > + iconv_t to_worktree; /* conversion to user-defined encoding */ > + struct encoding *next; > +} *encoding, **encoding_tail; This seems like an optimazation, assuning that iconv_open() eats a lot of ressources. I don't think this is the case. (Undere MacOS we run iconv_open() together with every opendir(), and optimizing this out did not show any measurable improvements) > +static const char *default_encoding = "utf-8"; The most portable form is "UTF-8" (correct me if that is wrong) > +static iconv_t invalid_conversion = (iconv_t)-1; > + > +static int encode_to_git(const char *path, const char *src, size_t src_len, > + struct strbuf *buf, struct encoding *enc) > +{ > +#ifndef NO_ICONV > + char *dst, *re_src; > + int dst_len, re_src_len; > + > + /* > + * No encoding is specified or there is nothing to encode. > + * Tell the caller that the content was not modified. > + */ > + if (!enc || (src && !src_len)) > + return 0; > + > + /* > + * Looks like we got called from "would_convert_to_git()". > + * This means Git wants to know if it would encode (= modify!) > + * the content. Let's answer with "yes", since an encoding was > + * specified. > + */ > + if (!buf && !src) > + return 1; > + > + if (enc->to_git == invalid_conversion) { > + enc->to_git = iconv_open(default_encoding, encoding->name); > + if (enc->to_git == invalid_conversion) > + warning(_("unsupported encoding %s"), encoding->name); > + } /* There are 2 different types of reaction: Either users know what that a warning means: You asked for problems, and do the right thing. Other may may ignore the warning - in both cases an error is useful */ > + > + if (enc->to_worktree == invalid_conversion) > + enc->to_worktree = iconv_open(encoding->name, default_encoding); > + > + /* > + * Do nothing if the encoding is not supported. This could happen in > + * two cases: > + * (1) The encoding is garbage. That is no problem as we would not > + * encode the content to UTF-8 on "add" and we would not encode > + * it back on "checkout". > + * (2) Git users use different iconv versions that support different > + * encodings. This could lead to problems, as the content might > + * not be encoded on "add" but encoded back on "checkout" (or > + * the other way around). > + * We print a one-time warning to the user in both cases above. > + */ Isn't an error better than "garbage in -> garbage out" ? > + if (enc->to_git == invalid_conversion || enc->to_worktree == invalid_conversion) > + return 0; > + > + dst = reencode_string_iconv(src, src_len, enc->to_git, &dst_len); I would suggest to write a function in strbuf and use it here: int strbuf_add_reencode((struct strbuf *sb, const void *data, size_t len, const char *from, const char *to) > + if (!dst) > + /* > + * We could add the blob "as-is" to Git. However, on checkout > + * we would try to reencode to the original encoding. This > + * would fail and we would leave the user with a messed-up > + * working tree. Let's try to avoid this by screaming loud. > + */ > + die(_("failed to encode '%s' from %s to %s"), > + path, enc->name, default_encoding); > + > + /* > + * Encode dst back to ensure no information is lost. This wastes "wastes" ? "uses" i would say. > + * a few cycles as most conversions are round trip conversion > + * safe. However, content that has an invalid encoding might not > + * match its original byte sequence after the UTF-8 conversion > + * round trip. Let's play safe here and check the round trip > + * conversion. > + */ > + re_src = reencode_string_iconv(dst, dst_len, enc->to_worktree, &re_src_len); > + if (!re_src || strcmp(src, re_src)) { > + die(_("encoding '%s' from %s to %s and back is not the same"), > + path, enc->name, default_encoding); > + } > + free(re_src); > + > + strbuf_attach(buf, dst, dst_len, dst_len + 1); > + return 1; > +#else > + warning(_("cannot encode '%s' from %s to %s because " > + "your Git was not compiled with encoding support"), > + path, enc->name, default_encoding); > + return 0; > +#endif > +} > + > +static int encode_to_worktree(const char *path, const char *src, size_t src_len, > + struct strbuf *buf, struct encoding *enc) Similar here: Better to use a (new) function from strbuf. > +{ > +#ifndef NO_ICONV > + char *dst; > + int dst_len; > + > + /* > + * No encoding is specified or there is nothing to encode. > + * Tell the caller that the content was not modified. > + */ > + if (!enc || (src && !src_len)) > + return 0; > + > + if (enc->to_worktree == invalid_conversion) { > + enc->to_worktree = iconv_open(encoding->name, default_encoding); > + if (enc->to_worktree == invalid_conversion) > + warning("unsupported encoding %s", encoding->name); > + } > + > + /* > + * Do nothing if the encoding is not supported. > + * See detailed explanation in encode_to_git(). > + */ > + if (enc->to_worktree == invalid_conversion) > + return 0; > + > + dst = reencode_string_iconv(src, src_len, enc->to_worktree, &dst_len); > + if (!dst) { > + warning("failed to encode '%s' from %s to %s", > + path, default_encoding, encoding->name); > + return 0; > + } > + > + strbuf_attach(buf, dst, dst_len, dst_len + 1); > + return 1; > +#else > + warning(_("cannot encode '%s' from %s to %s because " > + "your Git was not compiled with encoding support"), > + path, default_encoding, encoding->name); > + return 0; > +#endif > +} > + > static int crlf_to_git(const struct index_state *istate, > const char *path, const char *src, size_t len, > struct strbuf *buf, > @@ -969,6 +1113,32 @@ static int ident_to_worktree(const char *path, const char *src, size_t len, > return 1; > } > > +static struct encoding *git_path_check_encoding(struct attr_check_item *check) > +{ > + const char *value = check->value; > + struct encoding *enc; > + > + if (ATTR_UNSET(value)) > + return NULL; > + > + for (enc = encoding; enc; enc = enc->next) > + if (!strcmp(value, enc->name)) > + return enc; > + > + /* Don't encode to the default encoding */ > + if (!strcasecmp(value, default_encoding)) > + return NULL; > + > + enc = xcalloc(1, sizeof(struct convert_driver)); > + enc->name = xstrdup(value); > + enc->to_git = invalid_conversion; > + enc->to_worktree = invalid_conversion; > + *encoding_tail = enc; > + encoding_tail = &(enc->next); > + > + return enc; > +} > + > static enum crlf_action git_path_check_crlf(struct attr_check_item *check) > { > const char *value = check->value; > @@ -1024,6 +1194,7 @@ struct conv_attrs { > enum crlf_action attr_action; /* What attr says */ > enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */ > int ident; > + struct encoding *encoding; /* Supported encoding or default encoding if NULL */ > }; > > static void convert_attrs(struct conv_attrs *ca, const char *path) > @@ -1032,8 +1203,9 @@ static void convert_attrs(struct conv_attrs *ca, const char *path) > > if (!check) { > check = attr_check_initl("crlf", "ident", "filter", > - "eol", "text", NULL); > + "eol", "text", "encoding", NULL); > user_convert_tail = &user_convert; > + encoding_tail = &encoding; > git_config(read_convert_config, NULL); > } As written before, "worktree-encoding" should imply text. > > @@ -1055,6 +1227,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path) > else if (eol_attr == EOL_CRLF) > ca->crlf_action = CRLF_TEXT_CRLF; > } > + ca->encoding = git_path_check_encoding(ccheck + 5); > } else { > ca->drv = NULL; > ca->crlf_action = CRLF_UNDEFINED; > @@ -1135,6 +1308,13 @@ int convert_to_git(const struct index_state *istate, > src = dst->buf; > len = dst->len; > } > + > + ret |= encode_to_git(path, src, len, dst, ca.encoding); > + if (ret && dst) { > + src = dst->buf; > + len = dst->len; > + } > + > if (checksafe != SAFE_CRLF_KEEP_CRLF) { > ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, checksafe); > if (ret && dst) { > @@ -1158,6 +1338,7 @@ void convert_to_git_filter_fd(const struct index_state *istate, > if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL)) > die("%s: clean filter '%s' failed", path, ca.drv->name); > > + encode_to_git(path, dst->buf, dst->len, dst, ca.encoding); > crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, checksafe); > ident_to_git(path, dst->buf, dst->len, dst, ca.ident); > } > @@ -1189,6 +1370,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src, > } > } > > + ret |= encode_to_worktree(path, src, len, dst, ca.encoding); > + if (ret) { > + src = dst->buf; > + len = dst->len; > + } > + > ret_filter = apply_filter( > path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco); > if (!ret_filter && ca.drv && ca.drv->required) > @@ -1655,6 +1842,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s > if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean)) > return NULL; > > + if (ca.encoding) > + return NULL; > + > if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF) > return NULL; > > diff --git a/t/t0028-encoding.sh b/t/t0028-encoding.sh (I didn't review the test yet) Another comment for a possible improvement: "git diff" could be told to write the worktree-encoding into the diff, and "git apply" can pick that up.