Re: [PATCH 1/2] grep/pcre2: limit the instances in which UTF mode is enabled

Carlo Marcelo Arenas Belón <carenas@xxxxxxxxx> · Thu, 18 Nov 2021 11:40:21 -0800

On Thu, Nov 18, 2021 at 02:04:08AM -0800, Carlo Arenas wrote:
> 
> I haven't tested it, but I think that for this to work with the log,
> we also need to make sure that all log entries that might not be UTF-8
> get first iconv() which is why probably Æevar mentioned[1]
> i18n.commitEncoding in his old email.

Having tested it, it seems this might work, but needs at least the
following to make the test for it meaningful:

diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 1da6b07a57..e1fbdc0f80 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -47,6 +47,8 @@ test_expect_success REGEX_LOCALE 'grep string with regex, with -F' '
 '
 
 test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' '
+	GIT_COMMITTER_NAME="C O Mitter" &&
+	GIT_COMMITTER_EMAIL="committer@xxxxxxxxxxx" &&
 	git commit -m first &&
 	git log --format=%f -i -S"TILRAUN: HALLÓ HEIMUR!" >actual &&
 	echo first >expected &&
@@ -72,10 +74,10 @@ test_expect_success GETTEXT_LOCALE,PCRE 'log --committer with an ascii pattern o
 	EOF
 	test_write_lines "fifth" >file5 &&
 	git add file5 &&
-	GIT_COMMITTER_NAME="Ç O Mîtter" &&
+	GIT_COMMITTER_NAME=$(echo "Ç O Mîtter" | iconv -t ISO-8859-1) &&
 	GIT_COMMITTER_EMAIL="committer@xxxxxxxxxxx" &&
 	git -c i18n.commitEncoding=latin1 commit -m thïrd &&
-	git -c i18n.logOutputEncoding=latin1 log -1 --pretty=fuller --color=always --perl-regexp --committer=" O.*" >log &&
+	git log -1 --pretty=fuller --color=always --perl-regexp --committer=" O.*" >log &&
 	grep Commit: log >actual.raw &&
 	test_decode_color <actual.raw >actual &&
 	test_cmp expected actual

The first hunk is a nice to have and mimics what is done in the previous
test where a non UTF match will match instead an equivalent ASCII text,
but is not strictly needed unless the expression is also tightened.

Note the second hunk is incomplete as the message is still not really
encoded as latin1 and will need an equivalent processing, but left it
out so it can be done together with fixing the corresponding test.

I have to admit that adding to a complex condition might be obscuring
some other edge case, and indeed the fact the test passed even without
this fix is concerning.

>From my reading of what Ævar suggested originally[1], it would seem that
the new is_log condition should be on an branch of its own, and more code
should be needed to make sure the contents we are going to use are indeed
utf8 by the time it hits pcre2_*match() before setting it.

Carlo

[1] https://lore.kernel.org/git/211116.86tugcp8mg.gmgdl@xxxxxxxxxxxxxxxxxxx/