Re: Clarify the meaning of "character" in the documentation

"Kristoffer Haugsbakk" <code@xxxxxxxxxxxxxxx> · Tue, 05 Mar 2024 10:00:06 +0100

On Tue, Mar 5, 2024, at 09:43, Manlio Perillo wrote:
> The term "character" is confusing: does it mean 7bit/ASCII character
> or Unicode Code Point?

IMO it should say “ASCII” in contexts where it is restricted to
that. Otherwise UTF-8 can be assumed since git(1) handles that well.

> As an example, with
> git config --add core.commentChar •  // Bullet (U+2022)
> git does not complain, but it is rejected later.

I think this is more about `git config --add` not doing any
validation. It just sets things. You can do `git config --add
core.commentChar 'ffd'` and get the same effect.

> A counter example is using UTF-8 with "user.name", where it is handled
> correctly.

Yep.

It will also handle UTF-8 in cross-systems setting, in my experience: if
you generate patches with git-format-patch(1) it will handle UTF-8 that
ends up in email headers correctly (it needs its own encoding).

It’s quite UTF-8 friendly.

> I sent this email after reading the documentation of "git diff
> --color-moved=blocks, where the text says:
>> Blocks of moved text of at least 20 alphanumeric characters are detected greedily.
>
> In this case it is not clear if the number of characters are counted
> as UTF-8 or normal 8bit bytes.

Alphanumeric characters (a-z and A-Z and 0-9) are ASCII. And one ASCII
char is represented using one byte in UTF-8. This already looks precise
to me.

I’ve never run into a case where git-diff(1) does not handle UTF-8. I
don’t even know if it really needs to “handle” it per se as opposed to
just treating it as opaque bytes. Maybe it matters for things like
whitespace and word-boundaries, I don’t know.

-- 
Kristoffer Haugsbakk