Re: Git grep does not support multi-byte characters (like UTF-8)

Plamen Totev <plamen.totev@xxxxxx> · Tue, 7 Jul 2015 11:58:54 +0300 (EEST)

Nguyen, thanks for the help and the patch. Also the escaping suggested by Scharfe seems as good choice. But i dig some more into the problem and I found some other thing. That's why I replied on the main thread not on the patch. I hope you'll excuse me if this is a bad practice.

git grep -i -P also does not works because the PCRE_UTF8 is not set and pcre library does not treat the string as UTF-8.

pickaxe search also uses kwsearch so the case insensitive search with it does not work (e.g. git log -i -S).  Maybe this is a less of a problem here as one is expected to search for exact string (hence knows the case)

There is a interesting corner case. is_fixed treats all patterns containing nulls as fixed. So what about if the string contains non-ASCII symbols as well as nulls and the search is case insensitive :) I have to admin that my knowledge in UTF-8 is not enough to answer the question if this could occur during normal usage. For example the second byte in multi-byte symbol is NULL. I would guess that's not true as it would break a lot of programs that depend on NULL delimited string but it's good if somebody could confirm.

GNU grep indeed uses escaped regular expressions when the string is using multi-byte encoding and the search is case insensitive. If the encoding is UTF-8 then this strategy could be used in git too. Especially that git already have support and helper functions to work with UTF-8. As for the other multi-byte encodings - I think the things would become more complicated. As far I know in UTF-8 the '{' character for example is two bytes not one. Maybe really a support could be added only for the UTF-8 and if the string is not UTF-8 to issue a warning.

So maybe the following makes sense when a grep search is performed:
* check if the multi-byte encoding is used. If it's and the search is case insensitive and the encoding is not UTF-8 give a warning;
* if pcre is used and the string is UTF-8 encoded set the PCRE_UTF8 flag;
* if the search is case insensitive, the string is fixed and the encoding  used is UTF-8 use regcomp instead of kwsearch and escape any regex special characters in the pattern;

And the question with the behavior of pickaxe search remains open. Using kwset does not work with case insensitive non-ASCII searches. Instead of fixing grep.c maybe it's better if new function is introduced that performs keyword searches so it could be used by both grep, diffcore-pickaxe and any other code in the future that may require such functionality. Or maybe diffcore-pickaxe should use grep instead of directly kwset/regcomp

Regards,
Plamen Totev

>-------- Оригинално писмо -------- 
>От: Duy Nguyen pclouds@xxxxxxxxx 
>Относно: Re: Git grep does not support multi-byte characters (like UTF-8) 
>До: Plamen Totev <plamen.totev@xxxxxx> 
>Изпратено на: 06.07.2015 15:23 

> I think we over-optimized a bit. If you your system provides regex 
> with locale support (e.g. Linux) and you don't explicitly use fallback 
> regex implementation, it should work. I suppose your test patterns 
> look "fixed" (i.e. no regex special characters)? Can you try just add 
> "." and see if case insensitive matching works? 

This is indeed the problem. When I added the "." the matching works just fine.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html