Re: [PATCH v1] convert: add support for 'encoding' attribute

Lars Schneider <larsxschneider@xxxxxxxxx> · Fri, 29 Dec 2017 14:56:17 +0100

> On 29 Dec 2017, at 13:59, Torsten Bögershausen <tboegi@xxxxxx> wrote:
> 
> On 2017-12-28 17:14, Lars Schneider wrote:> 
>>> On 17 Dec 2017, at 18:14, Torsten Bögershausen <tboegi@xxxxxx> wrote:
>>> 
>>> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schneider@xxxxxxxxxxxx wrote:
>>>> From: Lars Schneider <larsxschneider@xxxxxxxxx>
>>>> 
>> 
>>>> +`encoding`
>>>> +^^^^^^^^^^
>>>> +
>>>> +By default Git assumes UTF-8 encoding for text files.  The `encoding`
>>> 
>>> This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8.
>> 
>> I am not sure I understand what you want to tell me here.
>> Do you think UTF-8 encoding is not correct in the sentence above?
> 
> Git itself was designed to handle source code in ASCII.
> Text files which are encoded in ISO-8859-1, UTF-8 or any
> super-set of ASCII are handled as well, and what exactly to do
> with bytes >0x80 is outside the responsibility of Git.
> E.g. the interpretation and rendering on the screen may be
> dependent on the locale or being guessed.
> All in all, saying that Git expects UTF-8 may be "overdriven".
> Any kind of file that uses an '\n' as an end of line
> and has no NUL bytes '\0' works as a text file in Git.
> (End of BlaBla ;-)
> 
> We can probably replace:
> "By default Git assumes UTF-8 encoding for text files"
> 
> with something like
> "Git handles UTF-8 encoded files as text files, but UTF-16 encoded
> files are handled as binary files"

Well, UTF-32 are handled as binary file, too :-)

How about this?

    Git recognizes files encoded with ASCII or one of its supersets
    (e.g. UTF-8 or ISO-8859-1) as text files.  All other...

> 
>>> 
>>>> +attribute sets the encoding to be used in the working directory.
>>>> +If the path is added to the index, then Git encodes the content to
>>>> +UTF-8.  On checkout the content is encoded back to the original
>>>> +encoding.  Consequently, you can use all built-in Git text processing
>>>> +tools (e.g. git diff, line ending conversions, etc.) with your
>>>> +non-UTF-8 encoded file.
>>>> +
>>>> +Please note that re-encoding content can cause errors and requires
>>>> +resources. Use the `encoding` attribute only if you cannot store
>>>> +a file in UTF-8 encoding and if you want Git to be able to process
>>>> +the text.
>>>> +
>>>> +------------------------
>>>> +*.txt		text encoding=UTF-16
>>>> +------------------------
>>> 
>>> I think that encoding (or worktree-encoding as said elsewhere) must imply text.
>>> (That is not done in convert.c, but assuming binary or even auto
>>> does not make much sense to me)
>> 
>> "text" only means "LF". "-text" means CRLF which would be perfectly fine
>> for UTF-16. Therefore I don't think we need to imply text.
>> Does this make sense?
> Yes and now.
> 
> "text" means convert CRLF at "git add" (or commit) into LF,
> And potentially LF into CRLF at checkout,
> depending on the EOL attribute (if set), core.autocrlf and/or core.eol.
> 
> "-text" means don't touch CRLF or LF at all. Files with CRLF are commited
> with CRLF.
> Running a Unix like "diff" tool will often show "^M" to indicate that there
> is a CR before the LF, which may be distracting or confusing.
> 
> If someone ever wants to handle the UTF-16 files on a Linux/Mac/Unix system,
> it makes sense to convert the line endings into LF before storing them
> into the index (at least this is my experience).
> 
> (Not specifying "text" or "-text" at all will trigger the auto-handling,
> which is not needed at all).
> So if we want to be sure that line endings are CRLF in the worktree we
> should ask the user to specify things like this:
> 
> *.ext worktree-encoding=UTF-16 text eol=CRLF
> 
> It may be enough to give this example in the documentation.
> and I would rather be over-careful here, to avoid future problems.

Agreed. I'll add that example!

>>>> +
>>>> +All `iconv` encodings with a stable round-trip conversion to and from
>>>> +UTF-8 are supported.  You can see a full list with the following command:
>>>> +
>>>> +------------------------
>>>> +iconv --list
>>>> +------------------------
>>>> +
>>>> `eol`
>>>> ^^^^^
>>>> 
>>>> diff --git a/convert.c b/convert.c
>>>> index 20d7ab67bd..ee19c17104 100644
>>>> --- a/convert.c
>>>> +++ b/convert.c
>>>> @@ -7,6 +7,7 @@
>>>> #include "sigchain.h"
>>>> #include "pkt-line.h"
>>>> #include "sub-process.h"
>>>> +#include "utf8.h"
>>>> 
>>>> /*
>>>> * convert.c - convert a file when checking it out and checking it in.
>>>> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>>>> 
>>>> }
>>> 
>>> I would avoid to use these #ifdefs here.
>>> All functions can be in strbuf.c (and may have #ifdefs there), but not here.
>> 
>> I'll try that. But wouldn't it make more sense to move the functions to utf.c?
> 
> May be.
> Originally utf8.c was about encoding and all kind of UTF-8 related stuff.
> Especially it didn't know anything about strbuf.
> I don't know why strbuf.h and other functions had been added here,
> 
> I once moved them into strbuf.c without any problems, but never send out
> a patch, because of possible merge conflicts in ongoing patches.
> 
> In any case, if it is about strbuf, I would try to put it into strbuf.c

I think I simplified the code. I reused "reencode_string_len".
I added just two BOM check functions to utf8.c ... I'll send out a new series
in a bit.

- Lars