commit messages are silently re-encoded to UTF-8 despite docs implying otherwise

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



When supplying a commit message that is not UTF-8, Git assumes it is encoded in latin1/ISO-8859-1, and silently does a latin1 -> UTF-8 re-encoding. That assumption is sensible, but the problem is the docs imply such a conversion does not happen.

Documentation/i18n.txt explicitly says (emphasis mine):
Although we encourage that the commit log messages are encoded
in UTF-8, both the core and Git Porcelain are designed **not to
force UTF-8** on projects.  If all participants of a particular
project find it more convenient to use legacy encodings, Git
does not forbid it.  However, there are a few things to keep in
mind.
. 'git commit' and 'git commit-tree' issues
  a warning if the commit log message given to it does not look
  like a valid UTF-8 string, unless you explicitly say your
  project uses a legacy encoding.  The way to say this is to
  have `i18n.commitEncoding` in `.git/config` file, like this:
[…]
Note that **we deliberately chose not to re-code the commit log
message when a commit is made to force UTF-8** at the commit
object level, because re-coding to UTF-8 is not necessarily a
reversible operation.

Said warning reads:
# Warning: commit message did not conform to UTF-8.
# You may want to amend it after fixing the message, or set the config
# variable i18n.commitEncoding to the encoding your project uses.

I interpret this as: that config variable is only used to silence the warning (and as courtesy to the future reader), not to control some behind-the-scenes conversion, especially because the note above says a re-coding doesn't happen.

But we can easily verify that a non-UTF-8 commit message produces the same commit:
---
#!/usr/bin/env bash

export {GIT_AUTHOR_NAME,GIT_AUTHOR_EMAIL,GIT_COMMITTER_EMAIL,GIT_COMMITTER_NAME}=me
export {GIT_AUTHOR_DATE,GIT_COMMITTER_DATE}=2005-04-07T22:13:13
# symbol 'Ä' is 0xc384 in utf-8 but 0xc4 in iso-8859-1

git commit-tree -m $(printf '\xc3\x84\n') $(git hash-object -t tree /dev/null)
# 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db

git commit-tree -m $(printf '\xc4\n') $(git hash-object -t tree /dev/null)
# Warning: commit message did not conform to UTF-8.
# You may want to amend it after fixing the message, or set the config
# variable i18n.commitEncoding to the encoding your project uses.
# 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db
---

It's debatable whether Git should even attempt to do an encoding conversion – I think it's fine if properly communicated, so at the very least the warning message should be changed to something like

# Warning: commit message did not conform to UTF-8 and was automatically
# converted. You possibly need to fix and amend it.
# To use a different encoding than UTF-8, set the config variable
# i18n.commitEncoding to the encoding your project uses.

and the documentation adapted similarly.

regards,
paul


PS: the code responsible for the conversion is in
commit.c +1654
/* And check the encoding */
if (encoding_is_utf8 && !verify_utf8(&buffer))
    fprintf(stderr, _(commit_utf8_warn));

and
/*
 * This verifies that the buffer is in proper utf8 format.
 *
 * If it isn't, it assumes any non-utf8 characters are Latin1,
 * and does the conversion.
 */
static int verify_utf8(struct strbuf *buf)



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux