Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 17.10.21 um 08:00 schrieb Junio C Hamano:
> René Scharfe <l.s.r@xxxxxx> writes:
>
>>>> Literal patterns are those that don't use any wildcards or case-folding.
>>>> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
>>>> pattern only consists of ASCII characters, or if the pattern is encoded
>>>> in UTF-8 and is not just a literal pattern.
>>>>
>>>> Hmm.  Why enable PCRE2_UTF for literal patterns that consist of only
>>>> ASCII chars?
>>>> ...
>>>     echo 'René Scharfe' >f &&
>>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $?
>>>     1
>>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $?
>>>     f:René Scharfe
>>>     0
>>>
>>> So it's a choose-your-own adventure where you can pick if you're
>>> right. I.e. do you want the "." metacharacter to match your "é" or not?
>>
>> Yes, I do, and it's what Hamza's patch is fixing.
>
> That may be correct but is this discussion still about "Why enable
> ... for literal patterns that consist of only ASCII"?  Calling "." a
> "metacharacter" and wanting it to match anything other than a single
> dot would mean the pattern we are discussing is no longer "literal",
> isn't it?  I am puzzled.

Right, Ævar's comment is not about my question, but highlights an
inconsistency in master that is fixed by Hamza's patch.

I'll repeat and extend my question: Hamza's patch enables PCRE2_UTF for
non-ASCII patterns even if they are literal or our locale is not UTF-8.
The following change would fix the edge case mentioned in its commit
message without these side-effects.  Am I correct?

diff --git a/grep.c b/grep.c
index fe847a0111..5badb6d851 100644
--- a/grep.c
+++ b/grep.c
@@ -382,7 +382,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
+	if (!opt->ignore_locale && is_utf8_locale() &&
 	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux