Re: Clarify the meaning of "character" in the documentation

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 05 Mar 2024 07:32:38 -0800

"Kristoffer Haugsbakk" <code@xxxxxxxxxxxxxxx> writes:

>> As an example, with
>> git config --add core.commentChar •  // Bullet (U+2022)
>> git does not complain, but it is rejected later.
>
> I think this is more about `git config --add` not doing any
> validation. It just sets things. You can do `git config --add
> core.commentChar 'ffd'` and get the same effect.

This is not wrong per-se, but it merely explains why "config" takes
it without complaining (the command just does not know anything
about what each variable means and what the valid range of values
are).  core.commentChar is limited to "a byte" so in the context of
everything else (like commit log message in the editor) being UTF-8,
it means ASCII would only work there.

As you said, we should document core.commentChar as limited to an
ASCII character, at least as a short term solution.

I personally do not see a reason, however, why we need to be limited
to a single byte, though.  If a patch cleanly implements to allow us
to use any one-or-more-byte sequence as core.commentChar, I do not
offhand see a good reason to reject it---it would be fully backward
compatible and allows you to use a UTF-8 charcter outside ASCII, as
well as "//" and the like.

> Alphanumeric characters (a-z and A-Z and 0-9) are ASCII. And one ASCII
> char is represented using one byte in UTF-8. This already looks precise
> to me.

Correct.

> I’ve never run into a case where git-diff(1) does not handle UTF-8. I
> don’t even know if it really needs to “handle” it per se as opposed to
> just treating it as opaque bytes. Maybe it matters for things like
> whitespace and word-boundaries, I don’t know.

The core part of "diff" is very much line oriented, and after
chopping your random sequence of bytes at each LF that appears in
it, the code is pretty oblivious to the character boundary, except
for a few cases.  "-w" needs to know what the whitespace characters
are (it knows only the limited basic set like SP HT and probably
VT), "-i" needs to know that "A" and "a" are equivalent (I think it
only knows the ASCII, but I may be misremembering).  Outside the
core part of "diff", there are frills that need to know about
character boundaries, like chopping the function header comment
placed on a hunk header "@@ -1682,7 +1682,7 @@" to a reasonable
length, --color-words/--word-diff that first separates lines into
multi-character tokens and align matching sequences in them, etc.