Am 15.10.21 um 22:03 schrieb Junio C Hamano: > Hamza Mahfooz <someguy@xxxxxxxxxxxxxxxxxxx> writes: > >> If we attempt to grep non-ascii log message text with an ascii pattern, we > > "with an ascii pattern, when Git is built with and told to use pcre2, we" > >> run into the following issue: >> >> $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author >> grep: (standard input): binary file matches I get no error message on macOS 11.6, but this result, with the underlined part in red: Author: ??var Arnfjörð Bjarmason <avarab@xxxxxxxxx> ^^^^^^^^^^^^^^^^^^ So the pattern matches the second byte of a two-byte character, inserts a color code in the middle and thus produces invalid output in this case. >> >> So, to fix this teach the grep code to use PCRE2_UTF, as long as the log >> output is encoded in UTF-8. > >> - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && >> - !(!opt->ignore_case && (p->fixed || p->is_fixed))) >> + if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) || >> + (!opt->ignore_locale && is_utf8_locale() && >> + has_non_ascii(p->pattern) && !(!opt->ignore_case && >> + (p->fixed || p->is_fixed)))) > > That's a mouthful. It is not obvious what new condition is being > added. I had to flip the order to see the only difference is, that > >> - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && >> - !(!opt->ignore_case && (p->fixed || p->is_fixed))) >> + if ((!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && >> + !(!opt->ignore_case && (p->fixed || p->is_fixed))) || >> + (!opt->ignore_locale && !has_non_ascii(p->pattern))) > > ... in addition to the case where the original condition holds, if > we do not say "ignore locale" and the pattern is ascii-only, we > apply these two option flags. And that matches what the proposed > log message explained as the condition the problem appears. > > So,... looks good, I guess. > > Thanks, will queue. > > > Addendum. > > If we were reordering pieces in the condition, I wonder if there is > a better way to reorganize it, though. The original is already > barely explainable with words, and with this new condition added, I > am not sure if anybody can phrase the condition in simple words to > others after staring it for a few minutes. I can't. > > But straightening it out is best left as a future clean-up patch, > separate from this series. > It can be written as: literal = !opt->ignore_case && (p->fixed || p->is_fixed); if (!opt->ignore_locale) { if (!has_non_ascii(p->pattern) || (is_utf8_locale() && !literal)) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); } Literal patterns are those that don't use any wildcards or case-folding. If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the pattern only consists of ASCII characters, or if the pattern is encoded in UTF-8 and is not just a literal pattern. Hmm. Why enable PCRE2_UTF for literal patterns that consist of only ASCII chars? The old condition was (reformatted to better match the new one): if (!opt->ignore_locale) { if (is_utf8_locale() && has_non_ascii(p->pattern) && !literal) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); } Intuitively I'd say the condition should be: if (!opt->ignore_locale && is_utf8_locale()) { if (has_non_ascii(p->pattern) || !literal) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); } If both input and pattern are encoded in UTF-8, enable PCRE2_UTF if we have to match non-ASCII characters or do more than just literal matching. For literal patterns that consist only of ASCII characters we don't need the cost and complication of PCRE2_UTF. Makes sense? René