[PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



When UTF is enabled for a PCRE match, the PCRE2_UTF flag is used
by the pcre2_compile() call, but that would only allow for the
use of Unicode character properties when caseless is required
but not to include the additional UTF characters for all other
class matches.

This would result in failed matches for expressions that rely
on those properties, for ex:

  $ git grep -P '\bÆvar'

Add a configuration that could be used to enable the PCRE2_UCP
flag to correctly match those cases, when required.

The use of this has an impact on performance that has been estimated
to be significant.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@xxxxxxxxx>
---
Changes since v2:
* make setting UCP and opt-in as suggested by Ævar
* remove performance test and instead add a test

 Documentation/config/grep.txt |  6 ++++++
 grep.c                        | 11 ++++++++++-
 grep.h                        |  1 +
 t/t7810-grep.sh               | 13 +++++++++++++
 4 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index e521f20390..8848db7311 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -26,3 +26,9 @@ grep.fullName::
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
 	is executed outside of a git repository.  Defaults to false.
+
+pcre.ucp::
+	If set to true, will use all Unicode Character Properties when matching
+	`\w`, `\b`, `\d` or the POSIX classes (ex: `[:alnum:]`) and PCRE is used
+	as the underlying engine. If PCRE is not being used it is ignored.
+	Defaults to false
diff --git a/grep.c b/grep.c
index 06eed69493..ceafb8937d 100644
--- a/grep.c
+++ b/grep.c
@@ -102,6 +102,12 @@ int grep_config(const char *var, const char *value, void *cb)
 			return config_error_nonbool(var);
 		return color_parse(value, color);
 	}
+
+	if (!strcmp(var, "pcre.ucp")) {
+		opt->pcre_ucp = git_config_bool(var, value);
+		return 0;
+	}
+
 	return 0;
 }
 
@@ -292,8 +298,11 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if (!opt->ignore_locale && is_utf8_locale() && !literal)
+	if (!opt->ignore_locale && is_utf8_locale() && !literal) {
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+		if (opt->pcre_ucp)
+			options |= PCRE2_UCP;
+	}
 
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
 	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
diff --git a/grep.h b/grep.h
index 6075f997e6..082bd3a0c7 100644
--- a/grep.h
+++ b/grep.h
@@ -171,6 +171,7 @@ struct grep_opt {
 	int file_break;
 	int heading;
 	int max_count;
+	int pcre_ucp;
 	void *priv;
 
 	void (*output)(struct grep_opt *opt, const void *data, size_t size);
diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
index 8eded6ab27..a99a967060 100755
--- a/t/t7810-grep.sh
+++ b/t/t7810-grep.sh
@@ -95,6 +95,7 @@ test_expect_success setup '
 	then
 		echo "¿" >reverse-question-mark
 	fi &&
+	echo "émotion" >ucp &&
 	git add . &&
 	test_tick &&
 	git commit -m initial
@@ -1474,6 +1475,18 @@ test_expect_success PCRE 'grep -P backreferences work (the PCRE NO_AUTO_CAPTURE
 	test_cmp hello_world actual
 '
 
+test_expect_success PCRE 'grep -c pcre.ucp -P fixes \b' '
+	cat >expected <<-\EOF &&
+	ucp:émotion
+	EOF
+	cat >pattern <<-\EOF &&
+	\bémotion
+	EOF
+	LC_ALL=en_US.UTF-8 git -c pcre.ucp=true grep -P -f pattern >actual &&
+	test_cmp expected actual &&
+	LC_ALL=en_US.UTF-8 test_must_fail git grep -P -f pattern
+'
+
 test_expect_success 'grep -G invalidpattern properly dies ' '
 	test_must_fail git grep -G "a["
 '
-- 
2.37.1 (Apple Git-137.1)




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux