Am 16.10.21 um 19:12 schrieb Ævar Arnfjörð Bjarmason: > > On Sat, Oct 16 2021, René Scharfe wrote: > >> Am 15.10.21 um 22:03 schrieb Junio C Hamano: >>> Hamza Mahfooz <someguy@xxxxxxxxxxxxxxxxxxx> writes: >>> >>>> If we attempt to grep non-ascii log message text with an ascii pattern, we >>> >>> "with an ascii pattern, when Git is built with and told to use pcre2, we" >>> >>>> run into the following issue: >>>> >>>> $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author >>>> grep: (standard input): binary file matches >> >> I get no error message on macOS 11.6, but this result, with the underlined >> part in red: >> >> Author: ??var Arnfjörð Bjarmason <avarab@xxxxxxxxx> >> ^^^^^^^^^^^^^^^^^^ >> >> So the pattern matches the second byte of a two-byte character, inserts a >> color code in the middle and thus produces invalid output in this case. > > Thanks for digging into these edge cases... > >>>> >>>> So, to fix this teach the grep code to use PCRE2_UTF, as long as the log >>>> output is encoded in UTF-8. >>> >>>> - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && >>>> - !(!opt->ignore_case && (p->fixed || p->is_fixed))) >>>> + if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) || >>>> + (!opt->ignore_locale && is_utf8_locale() && >>>> + has_non_ascii(p->pattern) && !(!opt->ignore_case && >>>> + (p->fixed || p->is_fixed)))) >>> >>> That's a mouthful. It is not obvious what new condition is being >>> added. I had to flip the order to see the only difference is, that >>> >>>> - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && >>>> - !(!opt->ignore_case && (p->fixed || p->is_fixed))) >>>> + if ((!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && >>>> + !(!opt->ignore_case && (p->fixed || p->is_fixed))) || >>>> + (!opt->ignore_locale && !has_non_ascii(p->pattern))) >>> >>> ... in addition to the case where the original condition holds, if >>> we do not say "ignore locale" and the pattern is ascii-only, we >>> apply these two option flags. And that matches what the proposed >>> log message explained as the condition the problem appears. >>> >>> So,... looks good, I guess. >>> >>> Thanks, will queue. >>> >>> >>> Addendum. >>> >>> If we were reordering pieces in the condition, I wonder if there is >>> a better way to reorganize it, though. The original is already >>> barely explainable with words, and with this new condition added, I >>> am not sure if anybody can phrase the condition in simple words to >>> others after staring it for a few minutes. I can't. >>> >>> But straightening it out is best left as a future clean-up patch, >>> separate from this series. >>> >> >> It can be written as: >> >> literal = !opt->ignore_case && (p->fixed || p->is_fixed); >> if (!opt->ignore_locale) { >> if (!has_non_ascii(p->pattern) || >> (is_utf8_locale() && !literal)) >> options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); >> } > > Whatever we go from here I'm very much for untangling that condition, > but I guess it can be done as a follow-up too, I'll defer to Hamza > there... > >> Literal patterns are those that don't use any wildcards or case-folding. >> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the >> pattern only consists of ASCII characters, or if the pattern is encoded >> in UTF-8 and is not just a literal pattern. >> >> Hmm. Why enable PCRE2_UTF for literal patterns that consist of only >> ASCII chars? >> >> The old condition was (reformatted to better match the new one): >> >> if (!opt->ignore_locale) { >> if (is_utf8_locale() && has_non_ascii(p->pattern) && !literal) >> options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); >> } >> >> Intuitively I'd say the condition should be: >> >> if (!opt->ignore_locale && is_utf8_locale()) { >> if (has_non_ascii(p->pattern) || !literal) >> options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); >> } >> >> If both input and pattern are encoded in UTF-8, enable PCRE2_UTF if we >> have to match non-ASCII characters or do more than just literal >> matching. >> >> For literal patterns that consist only of ASCII characters we don't need >> the cost and complication of PCRE2_UTF. >> >> Makes sense? > > echo 'René Scharfe' >f && > $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $? > 1 > $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $? > f:René Scharfe > 0 > > So it's a choose-your-own adventure where you can pick if you're > right. I.e. do you want the "." metacharacter to match your "é" or not? Yes, I do, and it's what Hamza's patch is fixing. > These sorts of patterns demonstrate nicely that the relationship between your > pattern being ASCII and wanting or not wanting UTF-8 matching semantics > isn't what you might imagine it to be. Differences are: o: opt->ignore_locale h: has_non_ascii(p->pattern) i: is_utf8_locale() l: literal o h i l master hamza rene 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 <== your first example 0 0 1 1 0 1 0 0 1 1 1 0 0 1 Turning on PCRE2_UTF when is_utf8_locale() == 0 seems wrong (first two lines). Turning on PCRE2_UTF for literal matching (fourth line) goes against 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching, 2019-07-26). Turning on PCRE2_UTF for literal matching of non-ASCII characters (fifth line) also goes against that, so my intuition betrayed me. When I adjust it, I get: if (!opt->ignore_locale && is_utf8_locale() && !literal) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); That looks deceptively simple -- just drop has_non_ascii(p->pattern) from the original condition. Your second example is handle the same by all versions btw.: o h i l master hamza rene 0 1 1 0 1 1 1 René