Re: Should the --encoding argument to log/show commands make any guarantees about their output?

Jan-Philip Gehrcke <jgehrcke@xxxxxxxxxxxxxx> · Tue, 16 Jun 2015 11:38:45 +0200

On 15.06.2015 18:21, Torsten Bögershausen wrote:
On 2015-06-15 10.50, Jan-Philip Gehrcke wrote:
Let me describe what I think it currently does:

The program attempts to re-code a log message, so it follows the chain

     raw input -> unicode -> raw output
Not sure what "raw input/output" means.
But there is only one reencode step involved, e.g.
input(8859) -> output(UTF-8)

We surely agree. With "raw" I meant a sequence of bytes, and with 
"unicode" I meant the intermediate state in the process of re-encoding 
(which can be thought of as decoding and encoding with a transient 
intermediate state).

If the user ignores this warning, how should Git guess the encoding  ?

I entirely appreciate that there is no satisfying solution to this very 
problem.

If this step fails (if the entry contains a byte sequence that is invalid in the specified/assumed input codec),
the procedure is aborted and the data is dumped as is (obviously without applying the requested output encoding).

Is that correct?
Yes, see above.

Thanks!

Hence, from my point of view, the rational that git show/log should be able to output *text* information means
that they should not emit byte sequences that are invalid in the codec specified via the --encoding argument.
In the current situation, the work of dealing with invalid byte sequences is just outsourced to software
further below in the tool chain
(at some point a replacement character � should be displayed to the user instead of the invalid raw bytes).

I am not entirely sure where this discussion should lead to.
Yes, until someone writes a patch to improve either the documentation or the code,
nothing will be changed.
However, I think that if the behavior of the software will not be changed,
then the documentation for the --encoding option should be more precise and
clarify what actually happens behind the scenes. What do you think?
Patches are more than welcome.

I'd be willing to contribute, but of course there must be a discussion 
and an agreement before that, if there is need to change something at 
all, and what exactly.

To this discussion I would like to contribute that I am of the opinion 
that there should be a command line option to make git show/log/friends 
emit a byte stream that is guaranteed to be valid in a given codec.

That would require detection and treatment of those cases where 
corrupted text resides in the repository (we cannot prevent it from 
entering the repository, as discussed). In these cases, one could emit a 
replacement symbol (e.g. '?') per invalid byte subsequence (this seems 
to be more established than just swallowing the invalid byte sequence).

What do you think?

I think the --encoding option would have ideal semantics for described 
behavior.

However, I guess maintaining backwards compatibility is an issue here. 
On the other hand, I realize that the --encoding option undergoes 
changes: the docs for git log in release 2.4.3 do not even list the 
--encoding option anymore. Why is that? I haven't found a corresponding 
changelog/release notes entry.

Thanks,

Jan-Philip
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html