Am 06.04.23 um 22:19 schrieb René Scharfe: > Since 1819ad327b (grep: fix multibyte regex handling under macOS, > 2022-08-26) we use the system library for all regular expression > matching on macOS, not just for git grep. It supports multi-byte > strings and rejects invalid multi-byte characters. > > This broke all built-in userdiff word regexes in UTF-8 locales because > they all include such invalid bytes in expressions that are intended to > match multi-byte characters without explicit support for that from the > regex engine. > > "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word > regexes to match a single non-space or multi-byte character. The \xNN > characters are invalid if interpreted as UTF-8 because they have their > high bit set, which indicates they are part of a multi-byte character, > but they are surrounded by single-byte characters. Perhpas the expression should be "[\xc4\x80-\xf7\xbf\xbf\xbf]+", i.e., sequences of code points U+0080 to U+10FFFF? > > Replace that expression with "|[^[:space:]]" if the regex engine > supports multi-byte matching, as there is no need to have an explicit > range for multi-byte characters then. This is not equivalent. The original treated a sequence of non-ASCII characters as a word. The new version treats each individual non-space character (both ASCII and non-ASCII) as a word. > Additionally the word regex for tex contains the expression > "[a-zA-Z0-9\x80-\xff]+" with a similarly invalid range. The best > replacement with only valid characters that I can come up with is > "([a-zA-Z0-9]|[^\x01-\x7f])+". Unlike the original it matches NUL > characters, though. Assuming that tex files usually don't contain NUL > this should be acceptable. This is acceptable, of course. The replacement range looks sensible. -- Hannes