"Kristoffer Haugsbakk" <code@xxxxxxxxxxxxxxx> writes: >> As an example, with >> git config --add core.commentChar • // Bullet (U+2022) >> git does not complain, but it is rejected later. > > I think this is more about `git config --add` not doing any > validation. It just sets things. You can do `git config --add > core.commentChar 'ffd'` and get the same effect. This is not wrong per-se, but it merely explains why "config" takes it without complaining (the command just does not know anything about what each variable means and what the valid range of values are). core.commentChar is limited to "a byte" so in the context of everything else (like commit log message in the editor) being UTF-8, it means ASCII would only work there. As you said, we should document core.commentChar as limited to an ASCII character, at least as a short term solution. I personally do not see a reason, however, why we need to be limited to a single byte, though. If a patch cleanly implements to allow us to use any one-or-more-byte sequence as core.commentChar, I do not offhand see a good reason to reject it---it would be fully backward compatible and allows you to use a UTF-8 charcter outside ASCII, as well as "//" and the like. > Alphanumeric characters (a-z and A-Z and 0-9) are ASCII. And one ASCII > char is represented using one byte in UTF-8. This already looks precise > to me. Correct. > I’ve never run into a case where git-diff(1) does not handle UTF-8. I > don’t even know if it really needs to “handle” it per se as opposed to > just treating it as opaque bytes. Maybe it matters for things like > whitespace and word-boundaries, I don’t know. The core part of "diff" is very much line oriented, and after chopping your random sequence of bytes at each LF that appears in it, the code is pretty oblivious to the character boundary, except for a few cases. "-w" needs to know what the whitespace characters are (it knows only the limited basic set like SP HT and probably VT), "-i" needs to know that "A" and "a" are equivalent (I think it only knows the ASCII, but I may be misremembering). Outside the core part of "diff", there are frills that need to know about character boundaries, like chopping the function header comment placed on a hunk header "@@ -1682,7 +1682,7 @@" to a reasonable length, --color-words/--word-diff that first separates lines into multi-character tokens and align matching sequences in them, etc.