Re: [PATCH 1/5] grep: stop modifying buffer in strip_timestamp

Jeff King <peff@xxxxxxxx> · Wed, 22 Sep 2021 16:20:53 -0400

On Tue, Sep 21, 2021 at 11:02:31PM +0200, Ævar Arnfjörð Bjarmason wrote:

> 
> On Tue, Sep 21 2021, Jeff King wrote:
> 
> > On Tue, Sep 21, 2021 at 09:37:23AM +0200, René Scharfe wrote:
> >
> >> > @@ -965,9 +953,12 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
> >> >  		bol += len;
> >> >  		switch (p->field) {
> >> >  		case GREP_HEADER_AUTHOR:
> >> > -		case GREP_HEADER_COMMITTER:
> >> > -			strip_timestamp(bol, &eol);
> >> > +		case GREP_HEADER_COMMITTER: {
> >> > +			char *em = memrchr(bol, '>', eol - bol);
> >> > +			if (em)
> >> > +				eol = em + 1;
> >> 
> >> The old code documents the intent via the function name.  The new one
> >> goes into the nitty-gritty without further explanation, which I find
> >> harder to read.
> >
> > Agreed. I do think the conversion is functionally correct, but it
> > doesn't strike me as worth the change.
> 
> As far as some general improvement in thish area it seems to me that
> this whole subthread is losing the forest for the trees.

I'd definitely agree with that. :)

> It probably makes sense to split up that commit.c code into something
> that can give you structured output, i.e. headers with types and
> start/end points for interesting data, then in grep.c we won't need a
> strip_anything(), or strrchr() or memrchr() or whatever.

I don't disagree with any of this, either, but I think it's a separate
(and much more complicated) topic than what this series is dealing with.
So my preference would be to take this as an immediate improvement, and
let anything like that get built on top.

> It would also be a lot faster for grepping if we could offload more of
> this work to the regex engine, particularly if we've got a more capable
> engine like PCREv2.
> 
> In many cases we're splitting lines ourselves, when we could have the
> engine work in a multi-line mode, or translate the user's --author match
> into something that can match the raw commit header. So have an implicit
> /^author /m anchor if we're matching author headers, instead of having
> grep.c string-twiddle that around one line at a time.

Ditto here. Multi-line matching may make things a lot more efficient,
but I think is out of scope for this series. That seems like an
interesting area for future work.

-Peff