Re: [PATCH v2 7/9] grep/pcre: support utf-8

Plamen Totev <plamen.totev@xxxxxx> · Sat, 11 Jul 2015 11:07:25 +0300 (EEST)

Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> writes:
> In the previous change in this function, we add locale support for 
> single-byte encodings only. It looks like pcre only supports utf-* as 
> multibyte encodings, the others are left in the cold (which is 
> fine). We need to enable PCRE_UTF8 so pcre can parse the string 
> correctly before folding case. 

> if (opt->ignore_case) { 
> p->pcre_tables = pcre_maketables(); 
> +	if (is_utf8_locale()) 
> +	options |= PCRE_UTF8; 
> options |= PCRE_CASELESS; 
> } 

We need to set the PCRE_UTF8 flag in all cases when the locale is UTF-8
not only when the search is case insensitive.
Otherwise pcre threats the encoding as single byte and if the regex contains
quantifiers it will not work as expected. The quantifier will try to match the
second byte of the multi-byte symbol instead of the whole symbol.

For example lets have file that contains the string

TILRAUN: HALLÓÓÓ HEIMUR!

the following command

git grep -P "HALLÓ{3}"

will not match the file while 

git grep -P "HAL{2}ÓÓÓ"

will. That's because the L symbol is a single byte.

Regards,
Plamen Totev

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html