Re: Git grep does not support multi-byte characters (like UTF-8)

Duy Nguyen <pclouds@xxxxxxxxx> · Tue, 7 Jul 2015 19:22:28 +0700

On Tue, Jul 7, 2015 at 3:58 PM, Plamen Totev <plamen.totev@xxxxxx> wrote:
> Nguyen, thanks for the help and the patch. Also the escaping suggested by Scharfe seems as good choice. But i dig some more into the problem and I found some other thing. That's why I replied on the main thread not on the patch. I hope you'll excuse me if this is a bad practice.

So far this is very good reporting. I can't complain :)

> git grep -i -P also does not works because the PCRE_UTF8 is not set and pcre library does not treat the string as UTF-8.

We do prefer utf-8, but i don't know if we can assume utf-8 everywhere
yet. I guess it's ok in this case.

> pickaxe search also uses kwsearch so the case insensitive search with it does not work (e.g. git log -i -S).  Maybe this is a less of a problem here as one is expected to search for exact string (hence knows the case)

Will fix (i'm close to being done with git-grep, not counting the pcre
bug you just found)

> There is a interesting corner case. is_fixed treats all patterns containing nulls as fixed. So what about if the string contains non-ASCII symbols as well as nulls and the search is case insensitive :) I have to admin that my knowledge in UTF-8 is not enough to answer the question if this could occur during normal usage. For example the second byte in multi-byte symbol is NULL. I would guess that's not true as it would break a lot of programs that depend on NULL delimited string but it's good if somebody could confirm.

For utf-8, if NUL occurs in a byte stream, it must be ASCII NUL, not
part of any multibyte character. Utf-8 is really well tuned for C
strings.

> GNU grep indeed uses escaped regular expressions when the string is using multi-byte encoding and the search is case insensitive. If the encoding is UTF-8 then this strategy could be used in git too. Especially that git already have support and helper functions to work with UTF-8. As for the other multi-byte encodings - I think the things would become more complicated. As far I know in UTF-8 the '{' character for example is two bytes not one. Maybe really a support could be added only for the UTF-8 and if the string is not UTF-8 to issue a warning.

In the worst case we could reuse the trick we do elsewhere in git:
convert to utf-8 with iconv, do whatever we need to (escaping...) then
convert back before feeding it to regcomp and friends.

> So maybe the following makes sense when a grep search is performed:
> * check if the multi-byte encoding is used. If it's and the search is case insensitive and the encoding is not UTF-8 give a warning;
> * if pcre is used and the string is UTF-8 encoded set the PCRE_UTF8 flag;
> * if the search is case insensitive, the string is fixed and the encoding  used is UTF-8 use regcomp instead of kwsearch and escape any regex special characters in the pattern;
>
> And the question with the behavior of pickaxe search remains open. Using kwset does not work with case insensitive non-ASCII searches. Instead of fixing grep.c maybe it's better if new function is introduced that performs keyword searches so it could be used by both grep, diffcore-pickaxe and any other code in the future that may require such functionality. Or maybe diffcore-pickaxe should use grep instead of directly kwset/regcomp

That would function be called "grep". More refactor would be needed.
"git grep regcomp" reveals some many places. Many some of them would
benefit from kws if we provide this new function you mentioned.

> Regards,
> Plamen Totev
>
>
>
>>-------- Оригинално писмо --------
>>От: Duy Nguyen pclouds@xxxxxxxxx
>>Относно: Re: Git grep does not support multi-byte characters (like UTF-8)
>>До: Plamen Totev <plamen.totev@xxxxxx>
>>Изпратено на: 06.07.2015 15:23
>
>> I think we over-optimized a bit. If you your system provides regex
>> with locale support (e.g. Linux) and you don't explicitly use fallback
>> regex implementation, it should work. I suppose your test patterns
>> look "fixed" (i.e. no regex special characters)? Can you try just add
>> "." and see if case insensitive matching works?
>
> This is indeed the problem. When I added the "." the matching works just fine.

-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html