On 2015-06-15 10.50, Jan-Philip Gehrcke wrote: > Hello, > > I was surprised to see that the output of > > git log --encoding=utf-8 "--format=format:%b" > > can contain byte sequences that are invalid in UTF-8. Note: I am using git 2.1.4 and the %b format specifier represents the commit message body. > > I have seen this with the Linux git repository and the following test: > > git log --encoding=utf-8 "--format=format:%b" | python2 -c \ > 'import sys; [l.decode("utf-8") for l in sys.stdin]' > > Soon enough errors like this appears: > > 'utf8' codec can't decode byte 0xf6 in position 19 > > The help message to the --encoding argument reads: > >> The commit objects record the encoding used for the log message in >> their encoding header; this option can be used to tell the command to >> re-code the commit log message in the encoding preferred by the user > > I realize that this message does not give any guarantee about the output of the command, in the sense that --encoding=utf-8 produces valid UTF-8 data in all cases. > > However, I wonder what --encoding precisely does and if it has the behavior most users would expect. > > Let me describe what I think it currently does: > > The program attempts to re-code a log message, so it follows the chain > > raw input -> unicode -> raw output Not sure what "raw input/output" means. But there is only one reencode step involved, e.g. input(8859) -> output(UTF-8) When the encoding of the commit message is undefined, UTF-8 is assumed. But Git does no verify if the encoding is really UTF-8. We could guess that if it is not UTF-8 then it is ISO-8859-1, but that is not implemented. > > For the first step, knowledge about the input encoding is required. When someone does a commit where the commit message does not conform to UTF-8, This message is shown from Git: "Warning: commit message did not conform to UTF-8.\n" "You may want to amend it after fixing the message, or set the config\n" "variable i18n.commitencoding to the encoding your project uses.\n"; If the user ignores this warning, how should Git guess the encoding ? (Later Git versions try do an auto-conversion assuming ISO-8859-1) , but that doesn't help real existing repos. > This is retrieved from the encoding header of the commit object if present or (from the docs) >"lack of this header implies that the commit log message is encoded in UTF-8." >If this step fails (if the entry contains a byte sequence that is invalid in the specified/assumed input codec), >the procedure is aborted and the data is dumped as is (obviously without applying the requested output encoding). > > Is that correct? Yes, see above. > > From my point of view the most natural abstraction of a log *message* is *text*, not bytes. >The same is true for author names. >If I want to build a tool chain on top of log/show, this usually means that I want to work with text information. >Hence, I want to retrieve text (a sequence of code points) from git show/log. >Text must be transported in encoded form, sure, >but it must not contain byte sequences that are invalid in this codec. >Because otherwise it's just not text anymore. > Call it corrupted. > Hence, from my point of view, the rational that git show/log should be able to output *text* information means > that they should not emit byte sequences that are invalid in the codec specified via the --encoding argument. > In the current situation, the work of dealing with invalid byte sequences is just outsourced to software > further below in the tool chain >(at some point a replacement character � should be displayed to the user instead of the invalid raw bytes). > > I am not entirely sure where this discussion should lead to. Yes, until someone writes a patch to improve either the documentation or the code, nothing will be changed. > However, I think that if the behavior of the software will not be changed, >then the documentation for the --encoding option should be more precise and >clarify what actually happens behind the scenes. What do you think? Patches are more than welcome. > > > Cheers, > > > Jan-Philip Gehrcke -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html