Re: [PATCH] userdiff: support regexec(3) with multi-byte support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 07.04.23 um 00:35 schrieb Johannes Sixt:
> Am 06.04.23 um 22:19 schrieb René Scharfe:
>> Since 1819ad327b (grep: fix multibyte regex handling under macOS,
>> 2022-08-26) we use the system library for all regular expression
>> matching on macOS, not just for git grep.  It supports multi-byte
>> strings and rejects invalid multi-byte characters.
>>
>> This broke all built-in userdiff word regexes in UTF-8 locales because
>> they all include such invalid bytes in expressions that are intended to
>> match multi-byte characters without explicit support for that from the
>> regex engine.
>>
>> "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word
>> regexes to match a single non-space or multi-byte character.  The \xNN
>> characters are invalid if interpreted as UTF-8 because they have their
>> high bit set, which indicates they are part of a multi-byte character,
>> but they are surrounded by single-byte characters.
>
> Perhpas the expression should be "[\xc4\x80-\xf7\xbf\xbf\xbf]+", i.e.,
> sequences of code points U+0080 to U+10FFFF?

regcomp(3) on macOS doesn't like it:

fatal: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[Ā-????]

Looks like it objects to U+10FFFF here; "[\xc4\x80-\xf3\xa0\x80\x80]" is
accepted for example.

\xc4\x80 is U+0100, by the way; U+0080 would be \xc2\x80.  And
regcomp(3) doesn't like that either ("[\xc2\x80-\xf3\xa0\x80\x80]"):

fatal: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[<U+0080>-󠀀]

>> Replace that expression with "|[^[:space:]]" if the regex engine
>> supports multi-byte matching, as there is no need to have an explicit
>> range for multi-byte characters then.
>
> This is not equivalent. The original treated a sequence of non-ASCII
> characters as a word. The new version treats each individual non-space
> character (both ASCII and non-ASCII) as a word.

I assume you mean "The original treated [a single non-space as well as]
a sequence of non-ASCII characters [making up a single multi-byte
character] as a word.".  That works as intended by 664d44ee7f (userdiff:
simplify word-diff safeguard, 2011-01-11).

The new one doesn't match multi-byte whitespace anymore.  What other
differences do they have?

René




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux