Am 30.03.23 um 09:55 schrieb Diomidis Spinellis: > On 30-Mar-23 1:55, Eric Sunshine wrote: >> I'm encountering a failure on macOS High Sierra 10.13.6 when using >> --color-words: > > The built-in word separation regular expression pattern for the Perl language fails to work with the macOS regex engine. The same also happens with the FreeBSD one (tested on 14.0). > > The issue can be replicated through the following sequence of commands. > > git init color-words > cd color-words > echo '*.pl diff=perl' >.gitattributes > echo 'print 42;' >t.pl > git add t.pl > git commit -am Add > git show --color-words Or in Git's own repo: $ git log -p --color-words --no-merges '*.c' Schwerwiegend: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[<C0>-<FF>][<80>-<BF>]+ commit 14b9a044798ebb3858a1f1a1377309a3d6054ac8 [...] The error disappears when localization is turned off: $ LANG=C git log -p --color-words --no-merges '*.c' >/dev/null # just finishes without an error The issue also vanishes when the "|[\xc0-\xff][\x80-\xbf]+" part is removed that the macros PATTERNS and IPATTERN in userdiff.c append. So it seems regcomp(1) on macOS doesn't like invalid Unicode characters unless it's in ASCII mode (LANG=C). 664d44ee7f (userdiff: simplify word-diff safeguard, 2011-01-11) explains that this part exists to match a multi-byte UTF-8 character. With a regcomp(1) that supports multi-byte characters natively they need to be specified differently, I guess, perhaps like this "[^\x00-\x7f]"? > Strangely, I haven't been able to reproduce the failure with egrep on any of the two platforms. > > egrep '[[:alpha:]_'\''][[:alnum:]_'\'']*|0[xb]?[0-9a-fA-F_]*|[0-9a-fA-F_]+(\.[0-9a-fA-F_]+)?([eE][-+]?[0-9_]+)?|=>|-[rwxoRWXOezsfdlpSugkbctTBMAC>]|~~|::|&&=|\|\|=|//=|\*\*=|&&|\|\||//|\+\+|--|\*\*|\.\.\.?|[-+*/%.^&<>=!|]=|=~|!~|<<|<>|<=>|>>|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+' /dev/null No idea how to specify non-ASCII bytes in shell or regex. '\xNN' does not seem to do the trick. printf(1) interpretes octal numbers, though: $ echo ö | egrep $(printf "[\200-\377]") egrep: illegal byte sequence (The regex contains "illegal bytes" -- UTF-8 multi-byte sequences cut short; the "ö" is OK.) René