Re: [PATCH 0/7] grep.c: teach --column to 'git-grep(1)'

Taylor Blau <me@xxxxxxxxxxxx> · Tue, 19 Jun 2018 12:44:52 -0500

On Tue, Jun 19, 2018 at 07:33:39PM +0200, René Scharfe wrote:
> Am 19.06.2018 um 18:35 schrieb Jeff King:
> > On Mon, Jun 18, 2018 at 06:43:01PM -0500, Taylor Blau wrote:
> >> The notable case that it does _not_ cover is matching the following
> >> line:
> >>
> >>    a ... b
> >>
> >> with the following expression
> >>
> >>    git grep --column -e b --or -e a
> >>
> >> This will produce the column for 'b' rather than the column for 'a',
> >> since we short-circuit an --or when the left child finds a match, in
> >> this case 'b'. So, we break the semantics for this case, at the benefit
> >> of not having to do twice the work.
> >>
> >> In the future, I'd like to revisit this, since any performance gains
> >> that we _do_ make in this area are moot when we rescan all lines in
> >> show_line() with --color. A path forward, I imagine, would look like a
> >> list of regmatch_t's, or a set of locations in the expression tree, such
> >> that we could either enumerate the list or walk the tree in order to
> >> colorize the line. But, I think for now that is #leftoverbits.
> >
> > The key thing about this iteration is that it doesn't regress
> > performance, because we always short-circuit where we used to. The other
> > obvious route is to stop short-circuiting only when "--column" is in
> > effect, which would have the same property (at the expense of a little
> > extra code in match_expr_eval()).
>
> The performance impact of the exhaustive search for --color scales with
> the number of shown lines, while it would scale with the total number of
> lines for --column.  Coloring the results of highly selective patterns
> is relatively cheap, short-circuiting them still helps significantly.

Sure. I was pointing out that we are repeating work in cases where it is
unnecessary to do so. It seems natural to me to take one of the two
above approaches, and optimize where we can.

> Disabling that optimization for --column wouldn't be a regression since
> it's a new option..  Picking a random result (based on the order of
> evaluation) seems sloppy and is probably going to surprise users.

That's fair, and something I'm happy to do if others feel OK about it.
Here is a patch to that effect:

diff --git a/grep.c b/grep.c
index f3329d82ed..a09935d8c5 100644
--- a/grep.c
+++ b/grep.c
@@ -1257,8 +1257,8 @@ static int match_one_pattern(struct grep_pat *p, char *bol, char *eol,
 	return hit;
 }

-static int match_expr_eval(struct grep_expr *x, char *bol, char *eol,
-			   enum grep_context ctx, ssize_t *col,
+static int match_expr_eval(struct grep_opt *opt, struct grep_expr *x, char *bol,
+			   char *eol, enum grep_context ctx, ssize_t *col,
 			   ssize_t *icol, int collect_hits)
 {
 	int h = 0;
@@ -1282,26 +1282,27 @@ static int match_expr_eval(struct grep_expr *x, char *bol, char *eol,
 		/*
 		 * Upon visiting a GREP_NODE_NOT, col and icol become swapped.
 		 */
-		h = !match_expr_eval(x->u.unary, bol, eol, ctx, icol, col, 0);
+		h = !match_expr_eval(opt, x->u.unary, bol, eol, ctx, icol, col, 0);
 		break;
 	case GREP_NODE_AND:
-		if (!match_expr_eval(x->u.binary.left, bol, eol, ctx, col,
+		if (!match_expr_eval(opt, x->u.binary.left, bol, eol, ctx, col,
 				     icol, 0))
 			return 0;
-		h = match_expr_eval(x->u.binary.right, bol, eol, ctx, col,
+		h = match_expr_eval(opt, x->u.binary.right, bol, eol, ctx, col,
 				    icol, 0);
 		break;
 	case GREP_NODE_OR:
-		if (!collect_hits)
-			return (match_expr_eval(x->u.binary.left, bol, eol, ctx,
+		if (!(collect_hits || opt->columnnum))
+			return (match_expr_eval(opt, x->u.binary.left, bol, eol, ctx,
 						col, icol, 0) ||
-				match_expr_eval(x->u.binary.right, bol, eol,
+				match_expr_eval(opt, x->u.binary.right, bol, eol,
 						ctx, col, icol, 0));
-		h = match_expr_eval(x->u.binary.left, bol, eol, ctx, col,
+		h = match_expr_eval(opt, x->u.binary.left, bol, eol, ctx, col,
 				    icol, 0);
-		x->u.binary.left->hit |= h;
-		h |= match_expr_eval(x->u.binary.right, bol, eol, ctx, col,
-				     icol, 1);
+		if (collect_hits)
+			x->u.binary.left->hit |= h;
+		h |= match_expr_eval(opt, x->u.binary.right, bol, eol, ctx, col,
+				     icol, collect_hits);
 		break;
 	default:
 		die("Unexpected node type (internal error) %d", x->node);
@@ -1316,7 +1317,7 @@ static int match_expr(struct grep_opt *opt, char *bol, char *eol,
 		      ssize_t *icol, int collect_hits)
 {
 	struct grep_expr *x = opt->pattern_expression;
-	return match_expr_eval(x, bol, eol, ctx, col, icol, collect_hits);
+	return match_expr_eval(opt, x, bol, eol, ctx, col, icol, collect_hits);
 }

 static int match_line(struct grep_opt *opt, char *bol, char *eol,

> We could add an optimizer pass to reduce the number of regular
> expressions in certain cases if that is really too slow.  E.g. this:
>
> 	$ git grep -e b -e a
>
> ... is equivalent to:
>
> 	$ git grep -e '\(b\)\|\(a\)'
>
> In that example the optimizer should use a single kwset instead of a
> regex, but you get the idea, namely to leave the short-circuiting to the
> regex engine or kwset, which probably do a good job of it.

I think that--while this pushes that decision to the appropriate level
of indirection--that it is out of scope for this series, and that the
above patch should do a sufficient job at not surprising users.

> René

Thanks,
Taylor