Re: [PATCH 1/5] grep: stop modifying buffer in strip_timestamp

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Tue, 21 Sep 2021 23:02:31 +0200

On Tue, Sep 21 2021, Jeff King wrote:

> On Tue, Sep 21, 2021 at 09:37:23AM +0200, René Scharfe wrote:
>
>> > @@ -965,9 +953,12 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
>> >  		bol += len;
>> >  		switch (p->field) {
>> >  		case GREP_HEADER_AUTHOR:
>> > -		case GREP_HEADER_COMMITTER:
>> > -			strip_timestamp(bol, &eol);
>> > +		case GREP_HEADER_COMMITTER: {
>> > +			char *em = memrchr(bol, '>', eol - bol);
>> > +			if (em)
>> > +				eol = em + 1;
>> 
>> The old code documents the intent via the function name.  The new one
>> goes into the nitty-gritty without further explanation, which I find
>> harder to read.
>
> Agreed. I do think the conversion is functionally correct, but it
> doesn't strike me as worth the change.

As far as some general improvement in thish area it seems to me that
this whole subthread is losing the forest for the trees.

We're operating a buffer that has "\n"'s in it, and then we loop and
split it up one line at a time, often just to match author OR
committer.

We then have things like commit_rewrite_person() in revision.c that's
preparing a "fake" commit just for grep.c's use, just for it to strip
those headers down again.

A real improvement in this area IMO would be trying to bridge the gap
between parse_commit_buffer() and grep.c's match_one_pattern().

I.e. we've even parsed this once before in commit.c's
parse_commit_date() by the time it gets to grep.c, now we're just doing
it again.

It probably makes sense to split up that commit.c code into something
that can give you structured output, i.e. headers with types and
start/end points for interesting data, then in grep.c we won't need a
strip_anything(), or strrchr() or memrchr() or whatever.

It would also be a lot faster for grepping if we could offload more of
this work to the regex engine, particularly if we've got a more capable
engine like PCREv2.

In many cases we're splitting lines ourselves, when we could have the
engine work in a multi-line mode, or translate the user's --author match
into something that can match the raw commit header. So have an implicit
/^author /m anchor if we're matching author headers, instead of having
grep.c string-twiddle that around one line at a time.

Just my 0.02.