Re: [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes:

>> In short, it sounds to me that the earlier one that added PCRE2_UCP
>> unconditionally would be the best alternative among those that have
>> been discussed.
>
> I agree.

So, ... we'll mark cb/grep-pcre-ucp, ea8bc435 (grep: correctly
identify utf-8 characters with \{b,w} in -P, 2023-01-08), to be
merged to 'next' and see what happens.  I'll add your Acked-by
while at it, if you do not mind.

---- >8 --------- >8 --------- >8 --------- >8 ------
From: Carlo Marcelo Arenas Belón <carenas@xxxxxxxxx>
Date: Sun, 8 Jan 2023 07:52:17 -0800
Subject: [PATCH] grep: correctly identify utf-8 characters with \{b,w} in -P

When UTF is enabled for a PCRE match, the corresponding flags are
added to the pcre2_compile() call, but PCRE2_UCP wasn't included.

This prevents extending the meaning of the character classes to
include those new valid characters and therefore result in failed
matches for expressions that rely on that extention, for ex:

  $ git grep -P '\bÆvar'

Add PCRE2_UCP so that \w will include Æ and therefore \b could
correctly match the beginning of that word.

This has an impact on performance that has been estimated to be
between 20% to 40% and that is shown through the added performance
test.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@xxxxxxxxx>
Acked-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>
Signed-off-by: Junio C Hamano <gitster@xxxxxxxxx>
---
 grep.c                              |  2 +-
 t/perf/p7822-grep-perl-character.sh | 42 +++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 1 deletion(-)
 create mode 100755 t/perf/p7822-grep-perl-character.sh

diff --git a/grep.c b/grep.c
index 06eed69493..1687f65b64 100644
--- a/grep.c
+++ b/grep.c
@@ -293,7 +293,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		options |= PCRE2_CASELESS;
 	}
 	if (!opt->ignore_locale && is_utf8_locale() && !literal)
-		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
 
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
 	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
diff --git a/t/perf/p7822-grep-perl-character.sh b/t/perf/p7822-grep-perl-character.sh
new file mode 100755
index 0000000000..87009c60df
--- /dev/null
+++ b/t/perf/p7822-grep-perl-character.sh
@@ -0,0 +1,42 @@
+#!/bin/sh
+
+test_description="git-grep's perl regex
+
+If GIT_PERF_GREP_THREADS is set to a list of threads (e.g. '1 4 8'
+etc.) we will test the patterns under those numbers of threads.
+"
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+test_checkout_worktree
+
+if test -n "$GIT_PERF_GREP_THREADS"
+then
+	test_set_prereq PERF_GREP_ENGINES_THREADS
+fi
+
+for pattern in \
+	'\\bhow' \
+	'\\bÆvar' \
+	'\\d+ \\bÆvar' \
+	'\\bBelón\\b' \
+	'\\w{12}\\b'
+do
+	echo '$pattern' >pat
+	if ! test_have_prereq PERF_GREP_ENGINES_THREADS
+	then
+		test_perf "grep -P '$pattern'" --prereq PCRE "
+			git -P grep -f pat || :
+		"
+	else
+		for threads in $GIT_PERF_GREP_THREADS
+		do
+			test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE "
+				git -c grep.threads=$threads -P grep -f pat || :
+			"
+		done
+	fi
+done
+
+test_done
-- 
2.39.1-231-ga7caae2729




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux