Re: [PATCH v5 5/7] builtin/grep.c: add '--column' option to 'git-grep(1)'

Taylor Blau <me@xxxxxxxxxxxx> · Thu, 17 May 2018 20:38:17 -0700

On Sat, May 12, 2018 at 03:07:04PM +0900, Junio C Hamano wrote:
> Taylor Blau <me@xxxxxxxxxxxx> writes:
>
> > I re-read your note and understand more clearly now what your suggestion
> > is. To ensure that we're in agreement, do you mean:
> >
> >   1. '--column -v' will _never_ give a column, but will never die(),
> >       either
>
> No, I don't.
>
> >   2. '--column --[and | or | not]' will never give a column, but will
> >       also never die(), either.
>
> No, I don't.
>
> If a file does not have substring "foo", then
>
> 	git grep -v -e foo file
> 	git grep --not -e foo file
>
> would hit all lines, just like
>
> 	git grep -e '.*' file
>
> does.
>
> I would expect that all of these
>
> 	git grep --column/-o -v -e foo file
> 	git grep --column/-o --not -e foo file
> 	git grep --column/-o -e '.*' file
>
> give the same output, which is what we would get if we consider the
> hit from "choose lines that lack 'foo'" on a line without 'foo' is
> caused by the entire contents on the line.  That is in line with
> "choose lines that has anything (including nothing)" aka ".*" would
> result in the entire line being reported via -o.  The byte offset of
> the first hit on such a line reported by --column is also 1, and
> that is a good and real answer to the question "git grep --column/-o"
> can give.

I agree with your message now and thank you for explaining what you
had written. I spoke with Peff off-list for a while to determine what I
think is essentially the answer to ``what are a set of semantics for
filling out a regmatch_t given an extended expression?''

It's helpful to recognize that the extended expressions are implemented
very much like a tree, so a reasonable semantics will lend itself well
to the way in which match_expr_eval() is implemented. Here's what we
came up with:

  * `git grep -e foo`. This is the case where the extended expression
    has a single atomic node in its tree. This falls into the "just call
    match_one_pattern()" case and has a simple answer: the starting
    offset and ending offset are that of whatever match_one_pattern
    gives.

  * `git grep --not -e foo`. This has the set of semantics that you
    describe above (the starting offset is 1), with the addition that
    the ending offset is the end of the line. This is similar to the
    fact that `--not foo` is very similar to `.$`.

  * `git grep --and -e foo -e bar`. This binary operation should recur
    on its sub-expressions and take the minimum of the starting offset
    and the maximum of the ending offset.

    For inputs of the form "foobar" and "foo bar", it will do the right
    thing (give the starting and ending offset for "foobar" and give no
    match, respectively).

  * `git grep --or -e foo -e bar`. This is the most complicated case, in
    my opinion. In going with the min/max idea in the and case above, I
    think that `--or` should also min/max its sub-expressions, but in
    fact we short-circuit evaluating the second sub-expression when we
    find a match for the first.

    So, in cases like matching `--or -e foo -e bar` with "foo baz bar",
    we'll do the right thing, since `foo` is the first sub-expression
    and happens to be the left-most match. In other words, we __adhere
    to our answer with the left-most match first__ semantics, but only
    because __the first sub-expression is the left-most match__.

    In the other case where we try and match the same expression against
    "bar baz foo", we'll return the starting offset of "foo", even
    though it isn't the left-most match, violating our semantics.

So, I propose we adopt the following: use the trivial answer for "foo",
the whole line for "--not", and min/max the starting/ending offsets for
binary operators, knowing that we will sometimes produce a weird answer
for --or.

I think that the semantics for --or are OK to go forward with, but would
be interested in the thoughts of others to figure out whether this is
sensible to everyone else.

Does this seem like an OK approach? Perhaps Peff can clarify some of
what's shared here, since we did speak elsewhere about it.

Thanks,
Taylor