On Sun, Oct 17 2021, René Scharfe wrote: > Am 17.10.21 um 08:00 schrieb Junio C Hamano: >> René Scharfe <l.s.r@xxxxxx> writes: >> >>>>> Literal patterns are those that don't use any wildcards or case-folding. >>>>> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the >>>>> pattern only consists of ASCII characters, or if the pattern is encoded >>>>> in UTF-8 and is not just a literal pattern. >>>>> >>>>> Hmm. Why enable PCRE2_UTF for literal patterns that consist of only >>>>> ASCII chars? >>>>> ... >>>> echo 'René Scharfe' >f && >>>> $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $? >>>> 1 >>>> $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $? >>>> f:René Scharfe >>>> 0 >>>> >>>> So it's a choose-your-own adventure where you can pick if you're >>>> right. I.e. do you want the "." metacharacter to match your "é" or not? >>> >>> Yes, I do, and it's what Hamza's patch is fixing. >> >> That may be correct but is this discussion still about "Why enable >> ... for literal patterns that consist of only ASCII"? Calling "." a >> "metacharacter" and wanting it to match anything other than a single >> dot would mean the pattern we are discussing is no longer "literal", >> isn't it? I am puzzled. > > Right, Ævar's comment is not about my question, but highlights an > inconsistency in master that is fixed by Hamza's patch. Yes, sorry about that. Just generally about the messy semantics. > I'll repeat and extend my question: Hamza's patch enables PCRE2_UTF for > non-ASCII patterns even if they are literal or our locale is not UTF-8. > The following change would fix the edge case mentioned in its commit > message without these side-effects. Am I correct? > > diff --git a/grep.c b/grep.c > index fe847a0111..5badb6d851 100644 > --- a/grep.c > +++ b/grep.c > @@ -382,7 +382,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > } > options |= PCRE2_CASELESS; > } > - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && > + if (!opt->ignore_locale && is_utf8_locale() && > !(!opt->ignore_case && (p->fixed || p->is_fixed))) > options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); I haven't had time to carefully look into this, but one caveat to check out is if it works with older PCRE and whether you need e.g. the GIT_PCRE2_VERSION_10_36_OR_HIGHER. I tried your suggestion on top of Hamza's series, compiled PCRE v2 10.23, tested, and also tried manually removing the PCRE2_MATCH_INVALID_UTF flag and tested again. We pass all tests with both, so maybe this is safe to do (or maybe we're missing some test we haven't thought of yet...). One thing that makes me nervous is that we had breakages in the past once the patches escaped into the wild, particularly because the code being modified here has is_utf8_locale(), and our tests run under LANG=c LC_ALL=C. I tried running all the tests with a non-C locale, there's a lot of failures, but none new with this change. As an aside the below patch makes all but one shortlog test pass for me. I wonder if we shouldn't do this for real to smoke out any $LANG or $LC_ALL dependencies. I.e. almost all of the failures were due to relying on the sort order of sort(1), and in one case comm(1), the first hunk here is also redundant to defining our own ls(1) wrapper.... diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index bd17f308b38..738ca6ef587 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -693,12 +693,12 @@ test_expect_success 'expire removes unreferenced packs' ' ^refs/heads/C EOF git multi-pack-index write && - ls .git/objects/pack | grep -v -e pack-[AB] >expect && + ls .git/objects/pack | sort | grep -v -e pack-[AB] >expect && git multi-pack-index expire && - ls .git/objects/pack >actual && + ls .git/objects/pack | sort >actual && test_cmp expect actual && - ls .git/objects/pack/ | grep idx >expect-idx && - test-tool read-midx .git/objects | grep idx >actual-midx && + ls .git/objects/pack/ | sort | grep idx >expect-idx && + test-tool read-midx .git/objects | sort | grep idx >actual-midx && test_cmp expect-idx actual-midx && git multi-pack-index verify && git fsck @@ -802,7 +802,7 @@ test_expect_success 'expire works when adding new packs' ' refs/heads/E EOF git multi-pack-index expire && - ls .git/objects/pack/ | grep idx >expect && + ls .git/objects/pack/ | sort | grep idx >expect && test-tool read-midx .git/objects | grep idx >actual && test_cmp expect actual && git multi-pack-index verify diff --git a/t/test-lib.sh b/t/test-lib.sh index 8361b5c1c57..f4f9d231f28 100644 --- a/t/test-lib.sh +++ b/t/test-lib.sh @@ -417,14 +417,22 @@ test -n "$BASH_VERSION" && shopt -u checkwinsize 2>/dev/null # For repeatability, reset the environment to known value. # TERM is sanitized below, after saving color control sequences. -LANG=C -LC_ALL=C +LANG=en_US.UTF-8 +LC_ALL=en_US.UTF-8 PAGER=cat TZ=UTC COLUMNS=80 export LANG LC_ALL PAGER TZ COLUMNS EDITOR=: +sort () { + LC_ALL=C LANG=C command sort "$@" +} + +comm () { + LC_ALL=C LANG=C command comm "$@" +} + # A call to "unset" with no arguments causes at least Solaris 10 # /usr/xpg4/bin/sh and /bin/ksh to bail out. So keep the unsets # deriving from the command substitution clustered with the other