On Tue, Mar 5, 2024, at 09:43, Manlio Perillo wrote: > The term "character" is confusing: does it mean 7bit/ASCII character > or Unicode Code Point? IMO it should say “ASCII” in contexts where it is restricted to that. Otherwise UTF-8 can be assumed since git(1) handles that well. > As an example, with > git config --add core.commentChar • // Bullet (U+2022) > git does not complain, but it is rejected later. I think this is more about `git config --add` not doing any validation. It just sets things. You can do `git config --add core.commentChar 'ffd'` and get the same effect. > A counter example is using UTF-8 with "user.name", where it is handled > correctly. Yep. It will also handle UTF-8 in cross-systems setting, in my experience: if you generate patches with git-format-patch(1) it will handle UTF-8 that ends up in email headers correctly (it needs its own encoding). It’s quite UTF-8 friendly. > I sent this email after reading the documentation of "git diff > --color-moved=blocks, where the text says: >> Blocks of moved text of at least 20 alphanumeric characters are detected greedily. > > In this case it is not clear if the number of characters are counted > as UTF-8 or normal 8bit bytes. Alphanumeric characters (a-z and A-Z and 0-9) are ASCII. And one ASCII char is represented using one byte in UTF-8. This already looks precise to me. I’ve never run into a case where git-diff(1) does not handle UTF-8. I don’t even know if it really needs to “handle” it per se as opposed to just treating it as opaque bytes. Maybe it matters for things like whitespace and word-boundaries, I don’t know. -- Kristoffer Haugsbakk