Re: [PATCH v10 2/2] pretty: colorize pattern matches in commit messages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Sep 29 2021, Hamza Mahfooz wrote:

> The "git log" command limits its output to the commits that contain strings
> matched by a pattern when the "--grep=<pattern>" option is used, but unlike
> output from "git grep -e <pattern>", the matches are not highlighted,
> making them harder to spot.
>
> Teach the pretty-printer code to highlight matches from the
> "--grep=<pattern>", "--author=<pattern>" and "--committer=<pattern>"
> options (to view the last one, you may have to ask for --pretty=fuller).
>
> Also, it must be noted that we are effectively greping the content twice,
> however it only slows down "git log --author=^H" on this repository by
> around 1-2% (compared to v2.33.0), so it should be a small enough slow
> down to justify the addition of the feature.

I tested the performance of this, that part looks good. Just for future
reference, is the running it twice just because it's a hassle to pass up
the match data or do the equivalent of a /g match on the first pass, and
doing this purely in the output emitter is easier?

There's a bug in how this highlighter handles UTF-8. I.e. with "grep" in
general we don't turn on UTF-8 unless the pattern itself is
UTF-8. Leading to this edge case:
    
    $ git grep Ævar -- CODE_OF_CONDUCT.md
    CODE_OF_CONDUCT.md:  - Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>
    $ git grep .var -- CODE_OF_CONDUCT.md
    CODE_OF_CONDUCT.md:  - <C3><86>var Arnfjörð Bjarmason <avarab@xxxxxxxxx>

See 95ca1f987ed (grep/pcre2: better support invalid UTF-8 haystacks,
2021-01-24) for the rabbit hole of why we can't just turn that on. We
need to be able to grep arbitrary binary crap by default.

But for "log" we can be much stricter with encodings, and we've got both
i18n.commitEncoding and i18n.logOutputEncoding in play. This should pay
attention to those.

In theory any such bugs pre-date this series, but in practice they
don't. We'd match in non-UTF8 PCRE mode before, which has some edge
cases, but since we're not highlighting anything not knowing character
boundaries usually doesn't matter.

But with this we've got the same edge case in the log options now:
    
    $ ~/g/git/git log --color --author='.var.*ö.*Bjar' -1 origin/master  | grep ^Author
    Author: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>
    $ ~/g/git/git log --color --author='.var.*Bjar' -1 origin/master  | grep ^Author
    grep: (standard input): binary file matches

I think just something like the below should fix this all up. It passes
all tests, although more would need to be added showing how this new
coloring mode handles utf8/non-utf8, and in combination with
e.g. i18n.commitEncoding=latin1 (which will emit "binary", at least
according to my otherwise UTF-8 GNU grep):

If we skip that "!has_non_ascii(p->pattern)' check one test fails, but
maybe that test is just stupid and we should die on that input
anyway. I.e. if we've said we're emitting UTF-8 shouldn't we be
validating/converting that somewhere before we feed it to PCRE and
friends? Maybe there's a good reason not to (and as I type this I
realize performance might suck, although "is ascii?" is fairly cheap,
and you'd only need to do it on !ASCII data).

diff --git a/grep.c b/grep.c
index fe847a0111a..ff09fbdd6cf 100644
--- a/grep.c
+++ b/grep.c
@@ -382,8 +382,9 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
-	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
+	if ((opt->utf8_all_the_things && !has_non_ascii(p->pattern) ) ||
+	    (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
+	    !(!opt->ignore_case && (p->fixed || p->is_fixed))))
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
 
 #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER
diff --git a/grep.h b/grep.h
index 808ad76f0c5..c9ddd637d18 100644
--- a/grep.h
+++ b/grep.h
@@ -167,6 +167,7 @@ struct grep_opt {
 	int extended_regexp_option;
 	int pattern_type_option;
 	int ignore_locale;
+	int utf8_all_the_things;
 	char colors[NR_GREP_COLORS][COLOR_MAXLEN];
 	unsigned pre_context;
 	unsigned post_context;
diff --git a/revision.c b/revision.c
index 0dabb5a0bcf..0d751dceb7e 100644
--- a/revision.c
+++ b/revision.c
@@ -2874,6 +2874,8 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 				 &revs->grep_filter);
 	if (!is_encoding_utf8(get_log_output_encoding()))
 		revs->grep_filter.ignore_locale = 1;
+	else
+		revs->grep_filter.utf8_all_the_things = 1;
 	compile_grep_patterns(&revs->grep_filter);
 
 	if (revs->reverse && revs->reflog_info)




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux