Should the --encoding argument to log/show commands make any guarantees about their output?

Jan-Philip Gehrcke <jgehrcke@xxxxxxxxxxxxxx> · Mon, 15 Jun 2015 10:50:26 +0200

Hello,

I was surprised to see that the output of

    git log --encoding=utf-8 "--format=format:%b"

can contain byte sequences that are invalid in UTF-8. Note: I am using 
git 2.1.4 and the %b format specifier represents the commit message body.

I have seen this with the Linux git repository and the following test:

    git log --encoding=utf-8 "--format=format:%b" | python2 -c \
        'import sys; [l.decode("utf-8") for l in sys.stdin]'

Soon enough errors like this appears:

    'utf8' codec can't decode byte 0xf6 in position 19

The help message to the --encoding argument reads:

The commit objects record the encoding used for the log message in
their encoding header; this option can be used to tell the command to
re-code the commit log message in the encoding preferred by the user

I realize that this message does not give any guarantee about the output 
of the command, in the sense that --encoding=utf-8 produces valid UTF-8 
data in all cases.

However, I wonder what --encoding precisely does and if it has the 
behavior most users would expect.

Let me describe what I think it currently does:

The program attempts to re-code a log message, so it follows the chain

	raw input -> unicode -> raw output

For the first step, knowledge about the input encoding is required. This 
is retrieved from the encoding header of the commit object if present or 
(from the docs) "lack of this header implies that the commit log message 
is encoded in UTF-8." If this step fails (if the entry contains a byte 
sequence that is invalid in the specified/assumed input codec), the 
procedure is aborted and the data is dumped as is (obviously without 
applying the requested output encoding).

Is that correct?

From my point of view the most natural abstraction of a log *message* 
is *text*, not bytes. The same is true for author names. If I want to 
build a tool chain on top of log/show, this usually means that I want to 
work with text information. Hence, I want to retrieve text (a sequence 
of code points) from git show/log. Text must be transported in encoded 
form, sure, but it must not contain byte sequences that are invalid in 
this codec. Because otherwise it's just not text anymore.

Hence, from my point of view, the rational that git show/log should be 
able to output *text* information means that they should not emit byte 
sequences that are invalid in the codec specified via the --encoding 
argument. In the current situation, the work of dealing with invalid 
byte sequences is just outsourced to software further below in the tool 
chain (at some point a replacement character � should be displayed to 
the user instead of the invalid raw bytes).

I am not entirely sure where this discussion should lead to. However, I 
think that if the behavior of the software will not be changed, then the 
documentation for the --encoding option should be more precise and 
clarify what actually happens behind the scenes. What do you think?

Cheers,

Jan-Philip Gehrcke

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html