Re: Should the --encoding argument to log/show commands make any guarantees about their output?

Torsten Bögershausen <tboegi@xxxxxx> · Mon, 15 Jun 2015 18:21:40 +0200

On 2015-06-15 10.50, Jan-Philip Gehrcke wrote:
> Hello,
> 
> I was surprised to see that the output of
> 
>     git log --encoding=utf-8 "--format=format:%b"
> 
> can contain byte sequences that are invalid in UTF-8. Note: I am using git 2.1.4 and the %b format specifier represents the commit message body.
> 
> I have seen this with the Linux git repository and the following test:
> 
>     git log --encoding=utf-8 "--format=format:%b" | python2 -c \
>         'import sys; [l.decode("utf-8") for l in sys.stdin]'
> 
> Soon enough errors like this appears:
> 
>     'utf8' codec can't decode byte 0xf6 in position 19
> 
> The help message to the --encoding argument reads:
> 
>> The commit objects record the encoding used for the log message in
>> their encoding header; this option can be used to tell the command to
>> re-code the commit log message in the encoding preferred by the user
> 
> I realize that this message does not give any guarantee about the output of the command, in the sense that --encoding=utf-8 produces valid UTF-8 data in all cases.
> 
> However, I wonder what --encoding precisely does and if it has the behavior most users would expect.
> 
> Let me describe what I think it currently does:
> 
> The program attempts to re-code a log message, so it follows the chain
> 
>     raw input -> unicode -> raw output
Not sure what "raw input/output" means.
But there is only one reencode step involved, e.g.
input(8859) -> output(UTF-8)
When the encoding of the commit message is undefined, UTF-8 is assumed.
But Git does no verify if the encoding is really UTF-8.
We could guess that if it is not UTF-8 then it is ISO-8859-1, but that is not implemented.

> 
> For the first step, knowledge about the input encoding is required. 
When someone does a commit where the commit message does not conform to UTF-8,
This message is shown from Git:
"Warning: commit message did not conform to UTF-8.\n"
"You may want to amend it after fixing the message, or set the config\n"
"variable i18n.commitencoding to the encoding your project uses.\n";

If the user ignores this warning, how should Git guess the encoding  ?
(Later Git versions try do an auto-conversion assuming ISO-8859-1) ,
but that doesn't help real existing repos.

> This is retrieved from the encoding header of the commit object if present or (from the docs) 
>"lack of this header implies that the commit log message is encoded in UTF-8." 
>If this step fails (if the entry contains a byte sequence that is invalid in the specified/assumed input codec), 
>the procedure is aborted and the data is dumped as is (obviously without applying the requested output encoding).
> 
> Is that correct?
Yes, see above.
> 
> From my point of view the most natural abstraction of a log *message* is *text*, not bytes. 

>The same is true for author names. 

>If I want to build a tool chain on top of log/show, this usually means that I want to work with text information. 
>Hence, I want to retrieve text (a sequence of code points) from git show/log. 
>Text must be transported in encoded form, sure, 
>but it must not contain byte sequences that are invalid in this codec. 
>Because otherwise it's just not text anymore.
> 
Call it corrupted.
> Hence, from my point of view, the rational that git show/log should be able to output *text* information means
> that they should not emit byte sequences that are invalid in the codec specified via the --encoding argument. 
> In the current situation, the work of dealing with invalid byte sequences is just outsourced to software
> further below in the tool chain 
>(at some point a replacement character � should be displayed to the user instead of the invalid raw bytes).
> 
> I am not entirely sure where this discussion should lead to. 
Yes, until someone writes a patch to improve either the documentation or the code,
nothing will be changed.
> However, I think that if the behavior of the software will not be changed, 
>then the documentation for the --encoding option should be more precise and 
>clarify what actually happens behind the scenes. What do you think?
Patches are more than welcome.
> 
> 
> Cheers,
> 
> 
> Jan-Philip Gehrcke

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html