Re: Clarify the meaning of "character" in the documentation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2024-03-05 at 09:00:06, Kristoffer Haugsbakk wrote:
> 
> On Tue, Mar 5, 2024, at 09:43, Manlio Perillo wrote:
> > I sent this email after reading the documentation of "git diff
> > --color-moved=blocks, where the text says:
> >> Blocks of moved text of at least 20 alphanumeric characters are detected greedily.
> >
> > In this case it is not clear if the number of characters are counted
> > as UTF-8 or normal 8bit bytes.
> 
> Alphanumeric characters (a-z and A-Z and 0-9) are ASCII. And one ASCII
> char is represented using one byte in UTF-8. This already looks precise
> to me.

I don't believe that's an appropriate definition. é is an alphanumeric
character, as is ç.  ½ is numeric.  I would argue an alphanumeric
character comprises at least Unicode classes Ll, Lm, Lo, Lt, Lu, and Nd.
Unicode TR#18 agrees with my assessment.

If we wanted to restrict it ASCII, we need to state that explicitly.
Alternately, if the constraint is 20 UTF-8 octets or something else, we
should state that instead.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux