> On 29 Dec 2017, at 13:59, Torsten Bögershausen <tboegi@xxxxxx> wrote: > > On 2017-12-28 17:14, Lars Schneider wrote:> >>> On 17 Dec 2017, at 18:14, Torsten Bögershausen <tboegi@xxxxxx> wrote: >>> >>> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schneider@xxxxxxxxxxxx wrote: >>>> From: Lars Schneider <larsxschneider@xxxxxxxxx> >>>> >> >>>> +`encoding` >>>> +^^^^^^^^^^ >>>> + >>>> +By default Git assumes UTF-8 encoding for text files. The `encoding` >>> >>> This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8. >> >> I am not sure I understand what you want to tell me here. >> Do you think UTF-8 encoding is not correct in the sentence above? > > Git itself was designed to handle source code in ASCII. > Text files which are encoded in ISO-8859-1, UTF-8 or any > super-set of ASCII are handled as well, and what exactly to do > with bytes >0x80 is outside the responsibility of Git. > E.g. the interpretation and rendering on the screen may be > dependent on the locale or being guessed. > All in all, saying that Git expects UTF-8 may be "overdriven". > Any kind of file that uses an '\n' as an end of line > and has no NUL bytes '\0' works as a text file in Git. > (End of BlaBla ;-) > > We can probably replace: > "By default Git assumes UTF-8 encoding for text files" > > with something like > "Git handles UTF-8 encoded files as text files, but UTF-16 encoded > files are handled as binary files" Well, UTF-32 are handled as binary file, too :-) How about this? Git recognizes files encoded with ASCII or one of its supersets (e.g. UTF-8 or ISO-8859-1) as text files. All other... > >>> >>>> +attribute sets the encoding to be used in the working directory. >>>> +If the path is added to the index, then Git encodes the content to >>>> +UTF-8. On checkout the content is encoded back to the original >>>> +encoding. Consequently, you can use all built-in Git text processing >>>> +tools (e.g. git diff, line ending conversions, etc.) with your >>>> +non-UTF-8 encoded file. >>>> + >>>> +Please note that re-encoding content can cause errors and requires >>>> +resources. Use the `encoding` attribute only if you cannot store >>>> +a file in UTF-8 encoding and if you want Git to be able to process >>>> +the text. >>>> + >>>> +------------------------ >>>> +*.txt text encoding=UTF-16 >>>> +------------------------ >>> >>> I think that encoding (or worktree-encoding as said elsewhere) must imply text. >>> (That is not done in convert.c, but assuming binary or even auto >>> does not make much sense to me) >> >> "text" only means "LF". "-text" means CRLF which would be perfectly fine >> for UTF-16. Therefore I don't think we need to imply text. >> Does this make sense? > Yes and now. > > "text" means convert CRLF at "git add" (or commit) into LF, > And potentially LF into CRLF at checkout, > depending on the EOL attribute (if set), core.autocrlf and/or core.eol. > > "-text" means don't touch CRLF or LF at all. Files with CRLF are commited > with CRLF. > Running a Unix like "diff" tool will often show "^M" to indicate that there > is a CR before the LF, which may be distracting or confusing. > > If someone ever wants to handle the UTF-16 files on a Linux/Mac/Unix system, > it makes sense to convert the line endings into LF before storing them > into the index (at least this is my experience). > > (Not specifying "text" or "-text" at all will trigger the auto-handling, > which is not needed at all). > So if we want to be sure that line endings are CRLF in the worktree we > should ask the user to specify things like this: > > *.ext worktree-encoding=UTF-16 text eol=CRLF > > It may be enough to give this example in the documentation. > and I would rather be over-careful here, to avoid future problems. Agreed. I'll add that example! >>>> + >>>> +All `iconv` encodings with a stable round-trip conversion to and from >>>> +UTF-8 are supported. You can see a full list with the following command: >>>> + >>>> +------------------------ >>>> +iconv --list >>>> +------------------------ >>>> + >>>> `eol` >>>> ^^^^^ >>>> >>>> diff --git a/convert.c b/convert.c >>>> index 20d7ab67bd..ee19c17104 100644 >>>> --- a/convert.c >>>> +++ b/convert.c >>>> @@ -7,6 +7,7 @@ >>>> #include "sigchain.h" >>>> #include "pkt-line.h" >>>> #include "sub-process.h" >>>> +#include "utf8.h" >>>> >>>> /* >>>> * convert.c - convert a file when checking it out and checking it in. >>>> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats, >>>> >>>> } >>> >>> I would avoid to use these #ifdefs here. >>> All functions can be in strbuf.c (and may have #ifdefs there), but not here. >> >> I'll try that. But wouldn't it make more sense to move the functions to utf.c? > > May be. > Originally utf8.c was about encoding and all kind of UTF-8 related stuff. > Especially it didn't know anything about strbuf. > I don't know why strbuf.h and other functions had been added here, > > I once moved them into strbuf.c without any problems, but never send out > a patch, because of possible merge conflicts in ongoing patches. > > In any case, if it is about strbuf, I would try to put it into strbuf.c I think I simplified the code. I reused "reencode_string_len". I added just two BOM check functions to utf8.c ... I'll send out a new series in a bit. - Lars