When supplying a commit message that is not UTF-8, Git assumes it is encoded in latin1/ISO-8859-1, and silently does a latin1 -> UTF-8 re-encoding. That assumption is sensible, but the problem is the docs imply such a conversion does not happen. Documentation/i18n.txt explicitly says (emphasis mine):
Although we encourage that the commit log messages are encoded in UTF-8, both the core and Git Porcelain are designed **not to force UTF-8** on projects. If all participants of a particular project find it more convenient to use legacy encodings, Git does not forbid it. However, there are a few things to keep in mind. . 'git commit' and 'git commit-tree' issues a warning if the commit log message given to it does not look like a valid UTF-8 string, unless you explicitly say your project uses a legacy encoding. The way to say this is to have `i18n.commitEncoding` in `.git/config` file, like this: […] Note that **we deliberately chose not to re-code the commit log message when a commit is made to force UTF-8** at the commit object level, because re-coding to UTF-8 is not necessarily a reversible operation.
Said warning reads:
# Warning: commit message did not conform to UTF-8. # You may want to amend it after fixing the message, or set the config # variable i18n.commitEncoding to the encoding your project uses.
I interpret this as: that config variable is only used to silence the warning (and as courtesy to the future reader), not to control some behind-the-scenes conversion, especially because the note above says a re-coding doesn't happen. But we can easily verify that a non-UTF-8 commit message produces the same commit: --- #!/usr/bin/env bash export {GIT_AUTHOR_NAME,GIT_AUTHOR_EMAIL,GIT_COMMITTER_EMAIL,GIT_COMMITTER_NAME}=me export {GIT_AUTHOR_DATE,GIT_COMMITTER_DATE}=2005-04-07T22:13:13 # symbol 'Ä' is 0xc384 in utf-8 but 0xc4 in iso-8859-1 git commit-tree -m $(printf '\xc3\x84\n') $(git hash-object -t tree /dev/null) # 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db git commit-tree -m $(printf '\xc4\n') $(git hash-object -t tree /dev/null) # Warning: commit message did not conform to UTF-8. # You may want to amend it after fixing the message, or set the config # variable i18n.commitEncoding to the encoding your project uses. # 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db --- It's debatable whether Git should even attempt to do an encoding conversion – I think it's fine if properly communicated, so at the very least the warning message should be changed to something like
# Warning: commit message did not conform to UTF-8 and was automatically # converted. You possibly need to fix and amend it. # To use a different encoding than UTF-8, set the config variable # i18n.commitEncoding to the encoding your project uses.
and the documentation adapted similarly. regards, paul PS: the code responsible for the conversion is in commit.c +1654
/* And check the encoding */ if (encoding_is_utf8 && !verify_utf8(&buffer)) fprintf(stderr, _(commit_utf8_warn));
and
/* * This verifies that the buffer is in proper utf8 format. * * If it isn't, it assumes any non-utf8 characters are Latin1, * and does the conversion. */ static int verify_utf8(struct strbuf *buf)