Re: [PATCH 1/5] grep: stop modifying buffer in strip_timestamp

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 23 Sep 2021 02:53:15 +0200

On Wed, Sep 22 2021, Jeff King wrote:

> On Tue, Sep 21, 2021 at 11:02:31PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> 
>> On Tue, Sep 21 2021, Jeff King wrote:
>> 
>> > On Tue, Sep 21, 2021 at 09:37:23AM +0200, René Scharfe wrote:
>> >
>> >> > @@ -965,9 +953,12 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
>> >> >  		bol += len;
>> >> >  		switch (p->field) {
>> >> >  		case GREP_HEADER_AUTHOR:
>> >> > -		case GREP_HEADER_COMMITTER:
>> >> > -			strip_timestamp(bol, &eol);
>> >> > +		case GREP_HEADER_COMMITTER: {
>> >> > +			char *em = memrchr(bol, '>', eol - bol);
>> >> > +			if (em)
>> >> > +				eol = em + 1;
>> >> 
>> >> The old code documents the intent via the function name.  The new one
>> >> goes into the nitty-gritty without further explanation, which I find
>> >> harder to read.
>> >
>> > Agreed. I do think the conversion is functionally correct, but it
>> > doesn't strike me as worth the change.
>> 
>> As far as some general improvement in thish area it seems to me that
>> this whole subthread is losing the forest for the trees.
>
> I'd definitely agree with that. :)
>
>> It probably makes sense to split up that commit.c code into something
>> that can give you structured output, i.e. headers with types and
>> start/end points for interesting data, then in grep.c we won't need a
>> strip_anything(), or strrchr() or memrchr() or whatever.
>
> I don't disagree with any of this, either, but I think it's a separate
> (and much more complicated) topic than what this series is dealing with.
> So my preference would be to take this as an immediate improvement, and
> let anything like that get built on top.
>
>> It would also be a lot faster for grepping if we could offload more of
>> this work to the regex engine, particularly if we've got a more capable
>> engine like PCREv2.
>> 
>> In many cases we're splitting lines ourselves, when we could have the
>> engine work in a multi-line mode, or translate the user's --author match
>> into something that can match the raw commit header. So have an implicit
>> /^author /m anchor if we're matching author headers, instead of having
>> grep.c string-twiddle that around one line at a time.
>
> Ditto here. Multi-line matching may make things a lot more efficient,
> but I think is out of scope for this series. That seems like an
> interesting area for future work.

Indeed, *way* outside this series. I'm I think the change you suggested
here makes sense for this change, just pointing out that for any
follow-up changes it's much more worthwhile to consider a few more
stackframes than just the 1-2 that involve those strings in that
particular form.