René Scharfe <l.s.r@xxxxxx> writes: >>> Literal patterns are those that don't use any wildcards or case-folding. >>> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the >>> pattern only consists of ASCII characters, or if the pattern is encoded >>> in UTF-8 and is not just a literal pattern. >>> >>> Hmm. Why enable PCRE2_UTF for literal patterns that consist of only >>> ASCII chars? >>> ... >> echo 'René Scharfe' >f && >> $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $? >> 1 >> $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $? >> f:René Scharfe >> 0 >> >> So it's a choose-your-own adventure where you can pick if you're >> right. I.e. do you want the "." metacharacter to match your "é" or not? > > Yes, I do, and it's what Hamza's patch is fixing. That may be correct but is this discussion still about "Why enable ... for literal patterns that consist of only ASCII"? Calling "." a "metacharacter" and wanting it to match anything other than a single dot would mean the pattern we are discussing is no longer "literal", isn't it? I am puzzled. >> These sorts of patterns demonstrate nicely that the relationship between your >> pattern being ASCII and wanting or not wanting UTF-8 matching semantics >> isn't what you might imagine it to be. > > Differences are: > > o: opt->ignore_locale > h: has_non_ascii(p->pattern) > i: is_utf8_locale() > l: literal > > o h i l master hamza rene > 0 0 0 0 0 1 0 > 0 0 0 1 0 1 0 > 0 0 1 0 0 1 1 <== your first example > 0 0 1 1 0 1 0 > 0 1 1 1 0 0 1 > > Turning on PCRE2_UTF when is_utf8_locale() == 0 seems wrong (first two > lines). > > Turning on PCRE2_UTF for literal matching (fourth line) goes against > 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching, > 2019-07-26). > > Turning on PCRE2_UTF for literal matching of non-ASCII characters (fifth > line) also goes against that, so my intuition betrayed me. When I > adjust it, I get: > > if (!opt->ignore_locale && is_utf8_locale() && !literal) > options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > > That looks deceptively simple -- just drop has_non_ascii(p->pattern) > from the original condition. > > Your second example is handle the same by all versions btw.: > > o h i l master hamza rene > 0 1 1 0 1 1 1 > > René