(please no top-posting, see my reply at the end) On Mon, Jun 19, 2023 at 03:22:27PM +0200, paul wrote: > When supplying a commit message that is not UTF-8, Git assumes it is encoded in latin1/ISO-8859-1, and silently does a latin1 -> UTF-8 re-encoding. That assumption is sensible, but the problem is the docs imply such a conversion does not happen. > > Documentation/i18n.txt explicitly says (emphasis mine): > > Although we encourage that the commit log messages are encoded > > in UTF-8, both the core and Git Porcelain are designed **not to > > force UTF-8** on projects. If all participants of a particular > > project find it more convenient to use legacy encodings, Git > > does not forbid it. However, there are a few things to keep in > > mind. > > . 'git commit' and 'git commit-tree' issues > > a warning if the commit log message given to it does not look > > like a valid UTF-8 string, unless you explicitly say your > > project uses a legacy encoding. The way to say this is to > > have `i18n.commitEncoding` in `.git/config` file, like this: > > […] > > Note that **we deliberately chose not to re-code the commit log > > message when a commit is made to force UTF-8** at the commit > > object level, because re-coding to UTF-8 is not necessarily a > > reversible operation. > > Said warning reads: > > # Warning: commit message did not conform to UTF-8. > > # You may want to amend it after fixing the message, or set the config > > # variable i18n.commitEncoding to the encoding your project uses. > > I interpret this as: that config variable is only used to silence the warning (and as courtesy to the future reader), not to control some behind-the-scenes conversion, especially because the note above says a re-coding doesn't happen. > > But we can easily verify that a non-UTF-8 commit message produces the same commit: > --- > #!/usr/bin/env bash > > export {GIT_AUTHOR_NAME,GIT_AUTHOR_EMAIL,GIT_COMMITTER_EMAIL,GIT_COMMITTER_NAME}=me > export {GIT_AUTHOR_DATE,GIT_COMMITTER_DATE}=2005-04-07T22:13:13 > # symbol 'Ä' is 0xc384 in utf-8 but 0xc4 in iso-8859-1 > > git commit-tree -m $(printf '\xc3\x84\n') $(git hash-object -t tree /dev/null) > # 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db > > git commit-tree -m $(printf '\xc4\n') $(git hash-object -t tree /dev/null) > # Warning: commit message did not conform to UTF-8. > # You may want to amend it after fixing the message, or set the config > # variable i18n.commitEncoding to the encoding your project uses. > # 4df6ed26ffd0b02c8b5c1d85ba184b02fbcaa2db > --- > > It's debatable whether Git should even attempt to do an encoding conversion – I think it's fine if properly communicated, so at the very least the warning message should be changed to something like > > > # Warning: commit message did not conform to UTF-8 and was automatically > > # converted. You possibly need to fix and amend it. > > # To use a different encoding than UTF-8, set the config variable > > # i18n.commitEncoding to the encoding your project uses. > > and the documentation adapted similarly. > > regards, > paul > > > PS: the code responsible for the conversion is in > commit.c +1654 > > /* And check the encoding */ > > if (encoding_is_utf8 && !verify_utf8(&buffer)) > > fprintf(stderr, _(commit_utf8_warn)); > > and > > /* > > * This verifies that the buffer is in proper utf8 format. > > * > > * If it isn't, it assumes any non-utf8 characters are Latin1, > > * and does the conversion. > > */ > > static int verify_utf8(struct strbuf *buf) Thanks for the well written report. I haven't had time to dig into the details with the same depth as you did, to find out what the best solution is. Fixing the documentation to reflect the reality is always good. In the beginning Git was not aware about encodings, kind of. Then Windows came, using local code pages, and all Linux distros switched to use UTF-8 everywhere. Having said that: Do you want to send a patch (or 2) to improve the situation ? And out of interest: How did you find that problem, on which OS, system ?