Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Oct 17 2021, René Scharfe wrote:

> Am 17.10.21 um 08:00 schrieb Junio C Hamano:
>> René Scharfe <l.s.r@xxxxxx> writes:
>>
>>>>> Literal patterns are those that don't use any wildcards or case-folding.
>>>>> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
>>>>> pattern only consists of ASCII characters, or if the pattern is encoded
>>>>> in UTF-8 and is not just a literal pattern.
>>>>>
>>>>> Hmm.  Why enable PCRE2_UTF for literal patterns that consist of only
>>>>> ASCII chars?
>>>>> ...
>>>>     echo 'René Scharfe' >f &&
>>>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $?
>>>>     1
>>>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $?
>>>>     f:René Scharfe
>>>>     0
>>>>
>>>> So it's a choose-your-own adventure where you can pick if you're
>>>> right. I.e. do you want the "." metacharacter to match your "é" or not?
>>>
>>> Yes, I do, and it's what Hamza's patch is fixing.
>>
>> That may be correct but is this discussion still about "Why enable
>> ... for literal patterns that consist of only ASCII"?  Calling "." a
>> "metacharacter" and wanting it to match anything other than a single
>> dot would mean the pattern we are discussing is no longer "literal",
>> isn't it?  I am puzzled.
>
> Right, Ævar's comment is not about my question, but highlights an
> inconsistency in master that is fixed by Hamza's patch.

Yes, sorry about that. Just generally about the messy semantics.

> I'll repeat and extend my question: Hamza's patch enables PCRE2_UTF for
> non-ASCII patterns even if they are literal or our locale is not UTF-8.
> The following change would fix the edge case mentioned in its commit
> message without these side-effects.  Am I correct?
>
> diff --git a/grep.c b/grep.c
> index fe847a0111..5badb6d851 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -382,7 +382,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  		}
>  		options |= PCRE2_CASELESS;
>  	}
> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
> +	if (!opt->ignore_locale && is_utf8_locale() &&
>  	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>  		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);

I haven't had time to carefully look into this, but one caveat to check
out is if it works with older PCRE and whether you need e.g. the
GIT_PCRE2_VERSION_10_36_OR_HIGHER.

I tried your suggestion on top of Hamza's series, compiled PCRE v2
10.23, tested, and also tried manually removing the
PCRE2_MATCH_INVALID_UTF flag and tested again.

We pass all tests with both, so maybe this is safe to do (or maybe we're
missing some test we haven't thought of yet...).

One thing that makes me nervous is that we had breakages in the past
once the patches escaped into the wild, particularly because the code
being modified here has is_utf8_locale(), and our tests run under
LANG=c LC_ALL=C.

I tried running all the tests with a non-C locale, there's a lot of
failures, but none new with this change. As an aside the below patch
makes all but one shortlog test pass for me. I wonder if we shouldn't do
this for real to smoke out any $LANG or $LC_ALL dependencies.

I.e. almost all of the failures were due to relying on the sort order of
sort(1), and in one case comm(1), the first hunk here is also redundant
to defining our own ls(1) wrapper....

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index bd17f308b38..738ca6ef587 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -693,12 +693,12 @@ test_expect_success 'expire removes unreferenced packs' '
 		^refs/heads/C
 		EOF
 		git multi-pack-index write &&
-		ls .git/objects/pack | grep -v -e pack-[AB] >expect &&
+		ls .git/objects/pack | sort | grep -v -e pack-[AB] >expect &&
 		git multi-pack-index expire &&
-		ls .git/objects/pack >actual &&
+		ls .git/objects/pack | sort >actual &&
 		test_cmp expect actual &&
-		ls .git/objects/pack/ | grep idx >expect-idx &&
-		test-tool read-midx .git/objects | grep idx >actual-midx &&
+		ls .git/objects/pack/ | sort | grep idx >expect-idx &&
+		test-tool read-midx .git/objects | sort | grep idx >actual-midx &&
 		test_cmp expect-idx actual-midx &&
 		git multi-pack-index verify &&
 		git fsck
@@ -802,7 +802,7 @@ test_expect_success 'expire works when adding new packs' '
 		refs/heads/E
 		EOF
 		git multi-pack-index expire &&
-		ls .git/objects/pack/ | grep idx >expect &&
+		ls .git/objects/pack/ | sort | grep idx >expect &&
 		test-tool read-midx .git/objects | grep idx >actual &&
 		test_cmp expect actual &&
 		git multi-pack-index verify
diff --git a/t/test-lib.sh b/t/test-lib.sh
index 8361b5c1c57..f4f9d231f28 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -417,14 +417,22 @@ test -n "$BASH_VERSION" && shopt -u checkwinsize 2>/dev/null
 
 # For repeatability, reset the environment to known value.
 # TERM is sanitized below, after saving color control sequences.
-LANG=C
-LC_ALL=C
+LANG=en_US.UTF-8
+LC_ALL=en_US.UTF-8
 PAGER=cat
 TZ=UTC
 COLUMNS=80
 export LANG LC_ALL PAGER TZ COLUMNS
 EDITOR=:
 
+sort () {
+	LC_ALL=C LANG=C command sort "$@"
+}
+
+comm () {
+	LC_ALL=C LANG=C command comm "$@"
+}
+
 # A call to "unset" with no arguments causes at least Solaris 10
 # /usr/xpg4/bin/sh and /bin/ksh to bail out.  So keep the unsets
 # deriving from the command substitution clustered with the other




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux