Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



René Scharfe <l.s.r@xxxxxx> writes:

>>> Literal patterns are those that don't use any wildcards or case-folding.
>>> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
>>> pattern only consists of ASCII characters, or if the pattern is encoded
>>> in UTF-8 and is not just a literal pattern.
>>>
>>> Hmm.  Why enable PCRE2_UTF for literal patterns that consist of only
>>> ASCII chars?
>>> ...
>>     echo 'René Scharfe' >f &&
>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $?
>>     1
>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $?
>>     f:René Scharfe
>>     0
>>
>> So it's a choose-your-own adventure where you can pick if you're
>> right. I.e. do you want the "." metacharacter to match your "é" or not?
>
> Yes, I do, and it's what Hamza's patch is fixing.

That may be correct but is this discussion still about "Why enable
... for literal patterns that consist of only ASCII"?  Calling "." a
"metacharacter" and wanting it to match anything other than a single
dot would mean the pattern we are discussing is no longer "literal",
isn't it?  I am puzzled.

>> These sorts of patterns demonstrate nicely that the relationship between your
>> pattern being ASCII and wanting or not wanting UTF-8 matching semantics
>> isn't what you might imagine it to be.
>
> Differences are:
>
> o: opt->ignore_locale
> h: has_non_ascii(p->pattern)
> i: is_utf8_locale()
> l: literal
>
> o h i l master hamza rene
> 0 0 0 0      0     1    0
> 0 0 0 1      0     1    0
> 0 0 1 0      0     1    1   <== your first example
> 0 0 1 1      0     1    0
> 0 1 1 1      0     0    1
>
> Turning on PCRE2_UTF when is_utf8_locale() == 0 seems wrong (first two
> lines).
>
> Turning on PCRE2_UTF for literal matching (fourth line) goes against
> 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching,
> 2019-07-26).
>
> Turning on PCRE2_UTF for literal matching of non-ASCII characters (fifth
> line) also goes against that, so my intuition betrayed me.  When I
> adjust it, I get:
>
> 	if (!opt->ignore_locale && is_utf8_locale() && !literal)
> 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>
> That looks deceptively simple -- just drop has_non_ascii(p->pattern)
> from the original condition.
>
> Your second example is handle the same by all versions btw.:
>
> o h i l master hamza rene
> 0 1 1 0      1     1    1
>
> René




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux