On 2024-03-05 at 09:00:06, Kristoffer Haugsbakk wrote: > > On Tue, Mar 5, 2024, at 09:43, Manlio Perillo wrote: > > I sent this email after reading the documentation of "git diff > > --color-moved=blocks, where the text says: > >> Blocks of moved text of at least 20 alphanumeric characters are detected greedily. > > > > In this case it is not clear if the number of characters are counted > > as UTF-8 or normal 8bit bytes. > > Alphanumeric characters (a-z and A-Z and 0-9) are ASCII. And one ASCII > char is represented using one byte in UTF-8. This already looks precise > to me. I don't believe that's an appropriate definition. é is an alphanumeric character, as is ç. ½ is numeric. I would argue an alphanumeric character comprises at least Unicode classes Ll, Lm, Lo, Lt, Lu, and Nd. Unicode TR#18 agrees with my assessment. If we wanted to restrict it ASCII, we need to state that explicitly. Alternately, if the constraint is 20 UTF-8 octets or something else, we should state that instead. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA
Attachment:
signature.asc
Description: PGP signature