> On 29 Dec 2017, at 18:28, Torsten Bögershausen <tboegi@xxxxxx> wrote: > > I probably need to look at convert.c more closer, some other comments inline. > > > On Fri, Dec 29, 2017 at 04:22:21PM +0100, lars.schneider@xxxxxxxxxxxx wrote: >> From: Lars Schneider <larsxschneider@xxxxxxxxx> >> >> Git and its tools (e.g. git diff) expect all text files in UTF-8 >> encoding. Git will happily accept content in all other encodings, too, >> but it might not be able to process the text (e.g. viewing diffs or >> changing line endings). > > UTF-8 is too strict, the text from below is more correct: > +Git recognizes files encoded with ASCII or one of its supersets (e.g. > +UTF-8 or ISO-8859-1) as text files. All other encodings are usually > +interpreted as binary and consequently built-in Git text processing > +tools (e.g. 'git diff') as well as most Git web front ends do not > +visualize the content. Agreed. I'll replace it. >> Add an attribute to tell Git what encoding the user has defined for a >> given file. If the content is added to the index, then Git converts the >> content to a canonical UTF-8 representation. On checkout Git will > > Minor question about "canonical": > Would this mean the same ? > ...then Git converts the content into UTF-8. I feel the word canonical makes sense in this context to express that UTF-8 is not some randomly chosen encoding. AFAIK UTF-8 and LF are the canonical form for text in Git, no? >> +In these cases you can teach Git the encoding of a file in the working > teach ? tell ? "tell", agreed! >> +directory with the `checkout-encoding` attribute. If a file with this >> +attributes is added to Git, then Git reencodes the content from the >> +specified encoding to UTF-8 and stores the result in its internal data >> +structure. > > Minor Q: >> its internal data structure. > Should we simply write "stores the result in the index" ? This text is targeted at the end user and I know a lot of end users that are confused by the word "index". How about: ...stores the result in its internal data structure ("the index"). Would that work? > >> On checkout the content is encoded back to the specified >> +encoding. > >> + >> +Please note that using the `checkout-encoding` attribute has a number >> +of drawbacks: >> + >> +- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause >> + errors as the conversion might not be round trip safe. >> + >> +- Reencoding content requires resources that might slow down certain >> + Git operations (e.g 'git checkout' or 'git add'). >> + >> +- Git clients that do not support the `checkout-encoding` attribute or >> + the used encoding will checkout the respective files as UTF-8 encoded. >> + That means the content appears to be different which could cause >> + trouble. Affected clients are older Git versions and alternative Git >> + implementations such as JGit or libgit2 (as of January 2018). >> + >> +Use the `checkout-encoding` attribute only if you cannot store a file in >> +UTF-8 encoding and if you want Git to be able to process the content as >> +text. >> + > > I would maybe rephrase a little bit (first things first): > > Please note that using the `checkout-encoding` attribute may have a number > of pitfalls: > > - Git clients that do not support the `checkout-encoding` attribute > will checkout the respective files as UTF-8 encoded, which typically > causes trouble. > Escpecialy when older Git versions are used or alternative Git > implementations such as JGit or libgit2 (as of January 2018). > > - Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause > errors as the conversion might not be round trip safe. > > - Reencoding content requires resources that might slow down certain > Git operations (e.g 'git checkout' or 'git add'). > > Use the `checkout-encoding` attribute only if you cannot store a file in > UTF-8 encoding and if you want Git to be able to process the content as > text. Agreed! > ----- > Side question: What happens if "the used encoding" is not supported by > the underlying iconv lib ? > Will Git fail, delete the file from the working tree ? > That may be worth to mention. (And I need to do the code-review) It should checkout the file in UTF-8 and print an error. If you would try to add a file with an unsupported encoding, then Git should die() and refuse the operation. (see t0028: check unsupported encoding) > >> +Use the following attributes if your '*.txt' files are UTF-16 encoded >> +with byte order mark (BOM) and you want Git to perform automatic line >> +ending conversion based on your platform. >> + >> +------------------------ >> +*.txt text checkout-encoding=UTF-16 >> +------------------------ >> + >> +Use the following attributes if your '*.txt' files are UTF-16 little >> +endian encoded without BOM and you want Git to use Windows line endings >> +in the working directory. >> + >> +------------------------ >> +*.txt checkout-encoding=UTF-16LE text eol=CRLF >> +------------------------ >> + >> +You can get a list of all available encodings on your platform with the >> +following command: >> + >> +------------------------ >> +iconv --list >> +------------------------ >> + >> + >> `ident` >> ^^^^^^^ >> >> diff --git a/apply.c b/apply.c >> index 321a9fa68d..c4bd5cf1f2 100644 >> --- a/apply.c >> +++ b/apply.c >> @@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch *patch, >> * should never look at the index when explicit crlf option >> * is given. >> */ >> - convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf); >> + convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf, 0); > > Hm. Do we really need another parameter here? > The only caller that needs it (and sets it to 1) is sha1_file.c. > When an invalid encoding/content is used, it should be safe to die() always? > Just loud thinking.. Unfortunately it is not safe to die() always. Peff described a situation here where you could not "checkout away" from an error state: https://public-inbox.org/git/20171215095838.GA3567@xxxxxxxxxxxxxxxxxxxxx/ > If really needed, it may need less changes in the code base, if a new function > convert_to_git_or_die() is defined, which is called by convert_to_git() > > (and the other alternative would be to convert safe_crlf into a bitmap, > see https://public-inbox.org/git/20171229132828.17594-1-tboegi@xxxxxx/T/#u > what do you think ?) I still need to review your patch, but a bitmap sounds like a good idea. But why do we need a special bitmap? Can't we just pass "flags" and calculate get_safe_crlf() in convert.c? >> >> + >> + if (has_prohibited_utf_bom(enc->name, src, src_len)) { >> + const char *error_msg = _( >> + "BOM is prohibited for '%s' if encoded as %s"); >> + const char *advise_msg = _( >> + "You told Git to treat '%s' as %s. A byte order mark " >> + "(BOM) is prohibited with this encoding. Either use " >> + "%.6s as checkout encoding or remove the BOM from the " >> + "file."); >> + >> + advise(advise_msg, path, enc->name, enc->name, enc->name); >> + if (write_obj) >> + die(error_msg, path, enc->name); >> + else >> + error(error_msg, path, enc->name); > > As said before, I would just die(). Or do I miss something ? > Being strict with BOMs seams like a good idea. See above, Peffs test case. We run convert in a lot of places (e.g. 'git diff' or if checkout away from a bad state). >> >> diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-checkout-encoding.sh >> new file mode 100755 >> index 0000000000..1a329ab933 >> --- /dev/null >> +++ b/t/t0028-checkout-encoding.sh >> @@ -0,0 +1,197 @@ >> +#!/bin/sh >> + >> +test_description='checkout-encoding conversion via gitattributes' >> + >> +. ./test-lib.sh >> + >> +test_expect_success 'setup test repo' ' >> + >> + text="hallo there!\ncan you read me?" && > > Is this portable ? (the "\n") I assume it as it is used in t0001, t0008, t0024 and others. Thanks for the review, Lars