> On 17 Dec 2017, at 18:14, Torsten Bögershausen <tboegi@xxxxxx> wrote: > > On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schneider@xxxxxxxxxxxx wrote: >> From: Lars Schneider <larsxschneider@xxxxxxxxx> >> >> +`encoding` >> +^^^^^^^^^^ >> + >> +By default Git assumes UTF-8 encoding for text files. The `encoding` > > This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8. I am not sure I understand what you want to tell me here. Do you think UTF-8 encoding is not correct in the sentence above? > >> +attribute sets the encoding to be used in the working directory. >> +If the path is added to the index, then Git encodes the content to >> +UTF-8. On checkout the content is encoded back to the original >> +encoding. Consequently, you can use all built-in Git text processing >> +tools (e.g. git diff, line ending conversions, etc.) with your >> +non-UTF-8 encoded file. >> + >> +Please note that re-encoding content can cause errors and requires >> +resources. Use the `encoding` attribute only if you cannot store >> +a file in UTF-8 encoding and if you want Git to be able to process >> +the text. >> + >> +------------------------ >> +*.txt text encoding=UTF-16 >> +------------------------ > > I think that encoding (or worktree-encoding as said elsewhere) must imply text. > (That is not done in convert.c, but assuming binary or even auto > does not make much sense to me) "text" only means "LF". "-text" means CRLF which would be perfectly fine for UTF-16. Therefore I don't think we need to imply text. Does this make sense? > > >> + >> +All `iconv` encodings with a stable round-trip conversion to and from >> +UTF-8 are supported. You can see a full list with the following command: >> + >> +------------------------ >> +iconv --list >> +------------------------ >> + >> `eol` >> ^^^^^ >> >> diff --git a/convert.c b/convert.c >> index 20d7ab67bd..ee19c17104 100644 >> --- a/convert.c >> +++ b/convert.c >> @@ -7,6 +7,7 @@ >> #include "sigchain.h" >> #include "pkt-line.h" >> #include "sub-process.h" >> +#include "utf8.h" >> >> /* >> * convert.c - convert a file when checking it out and checking it in. >> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats, >> >> } > > I would avoid to use these #ifdefs here. > All functions can be in strbuf.c (and may have #ifdefs there), but not here. I'll try that. But wouldn't it make more sense to move the functions to utf.c? > >> >> +#ifdef NO_ICONV >> +#ifndef _ICONV_T >> +/* The type is just a placeholder and not actually used. */ >> +typedef void* iconv_t; >> +#endif >> +#endif >> + >> +static struct encoding { >> + const char *name; >> + iconv_t to_git; /* conversion to Git's canonical encoding (UTF-8) */ >> + iconv_t to_worktree; /* conversion to user-defined encoding */ >> + struct encoding *next; >> +} *encoding, **encoding_tail; > > This seems like an optimazation, assuning that iconv_open() eats a lot of ressources. > I don't think this is the case. (Undere MacOS we run iconv_open() together with > every opendir(), and optimizing this out did not show any measurable improvements) True, but then I would need to free() the memory in a lot of places. Therefore I thought it is easier to keep the object. OK for you? >> +static const char *default_encoding = "utf-8"; > > The most portable form is "UTF-8" (correct me if that is wrong) It shouldn't matter. But I've changed it to uppercase to be on the safe side. >> +static iconv_t invalid_conversion = (iconv_t)-1; >> + >> +static int encode_to_git(const char *path, const char *src, size_t src_len, >> + struct strbuf *buf, struct encoding *enc) >> +{ >> +#ifndef NO_ICONV >> + char *dst, *re_src; >> + int dst_len, re_src_len; >> + >> + /* >> + * No encoding is specified or there is nothing to encode. >> + * Tell the caller that the content was not modified. >> + */ >> + if (!enc || (src && !src_len)) >> + return 0; >> + >> + /* >> + * Looks like we got called from "would_convert_to_git()". >> + * This means Git wants to know if it would encode (= modify!) >> + * the content. Let's answer with "yes", since an encoding was >> + * specified. >> + */ >> + if (!buf && !src) >> + return 1; >> + >> + if (enc->to_git == invalid_conversion) { >> + enc->to_git = iconv_open(default_encoding, encoding->name); >> + if (enc->to_git == invalid_conversion) >> + warning(_("unsupported encoding %s"), encoding->name); >> + } > > /* There are 2 different types of reaction: > Either users know what that a warning means: You asked for problems, > and do the right thing. Other may may ignore the warning > - in both cases an error is useful */ Agreed! >> + if (enc->to_worktree == invalid_conversion) >> + enc->to_worktree = iconv_open(encoding->name, default_encoding); >> + >> + /* >> + * Do nothing if the encoding is not supported. This could happen in >> + * two cases: >> + * (1) The encoding is garbage. That is no problem as we would not >> + * encode the content to UTF-8 on "add" and we would not encode >> + * it back on "checkout". >> + * (2) Git users use different iconv versions that support different >> + * encodings. This could lead to problems, as the content might >> + * not be encoded on "add" but encoded back on "checkout" (or >> + * the other way around). >> + * We print a one-time warning to the user in both cases above. >> + */ > > Isn't an error better than "garbage in -> garbage out" ? Agreed. I changed the warning to an error. >> diff --git a/t/t0028-encoding.sh b/t/t0028-encoding.sh > > (I didn't review the test yet) > > Another comment for a possible improvement: > "git diff" could be told to write the worktree-encoding into the diff, > and "git apply" can pick that up. Yes, we could do that. However, I would tackle that in a separate series. Thanks, Lars