Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Nov 17 2021, René Scharfe wrote:

> Am 16.11.21 um 10:38 schrieb Carlo Arenas:
>> On Tue, Nov 16, 2021 at 1:30 AM Andreas Schwab <schwab@xxxxxxxxxxxxxx> wrote:
>>>
>>> expecting success of 7812.13 'PCRE v2: grep ASCII from invalid UTF-8 data':
>>>         git grep -h "var" invalid-0x80 >actual &&
>>>         test_cmp expected actual &&
>>>         git grep -h "(*NO_JIT)var" invalid-0x80 >actual &&
>>>         test_cmp expected actual
>>>
>>> ++ git grep -h var invalid-0x80
>>> ++ test_cmp expected actual
>>> ++ test 2 -ne 2
>>> ++ eval 'diff -u' '"$@"'
>>> +++ diff -u expected actual
>>> ++ git grep -h '(*NO_JIT)var' invalid-0x80
>>> fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set
>>
>> That is exactly what I was worried about, this is not failing one
>> test, but making `git grep` unusable in any repository that has any
>> binary files that might be reachable by it, and it is likely affecting
>> anyone using PCRE older than 10.34
>
> Let's have a look at the map.  Here are the differences between the
> versions regarding use of PCRE2_UTF:
>
> o: opt->ignore_locale
> h: has_non_ascii(p->pattern)
> i: is_utf8_locale()
> l: !opt->ignore_case && (p->fixed || p->is_fixed)
>
> o h i l master hamza rene2
> 0 0 0 0      0     1     0
> 0 0 0 1      0     1     0
> 0 0 1 0      0     1     1
> 0 0 1 1      0     1     0  <== 7812.13, confirmed using fprint() debugging
>
> So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@xxxxxx/
> should not have this breakage, because it doesn't enable PCRE2_UTF for
> literal patterns.

PCRE2_UTF will also matter for literal patterns. Try to peel apart the
two bytes in "é" and match them under -i with/without PCRE_UTF.

I don't know what the right behavior should be here (haven't had time to
dig), but it matters for matching.




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux