Re: git grep --threads 12 --textconv is effectively single-threaded

Jeff King <peff@xxxxxxxx> · Thu, 9 Jul 2020 19:10:30 -0400

On Wed, Jul 08, 2020 at 02:06:31PM -0700, Junio C Hamano wrote:

> Jeff King <peff@xxxxxxxx> writes:
> 
> > It's probably possible to teach the grep code to do the same
> > check-in-the-index trick, but I'm not sure how complicated it would be.
> 
> I am not sure if we should even depend on the "check the object
> database and use it instead of reading the working tree files" done
> in diff code---somehow I thought we did the opposite for performance
> (i.e. when we ought to be comparing two objects, taken from tree and
> the index, if we notice that the index side is stat clean, we can
> read/mmap the working tree file instead of going to the object layer
> and deflating a loose object, or, worse yet, construct the blob by
> repeatedly applying deltas on a base object in a packfile).
> 
> Is this one in the opposite direction done specifically for gaining
> performance when textconv cache is in use?  If so, kudos to whoever
> did it---that sounds like a clever thing to do.

No, it turns out that nobody was that clever (and I was simply
misremembering how it worked).

For a tree-to-tree or index-to-tree comparison, both sides will have an
oid and can use the textconv cache. Even for an index case where we
might choose to use a stat-fresh working tree file as an optimization,
we'll still consult the textconv cache before loading those contents.

But for diffing a file in the working tree, we'll never have an oid and
will always run the textconv command). So "git diff" against the index,
for example, would run _one_ textconv (using the cached value for the
index, and running one for the working tree version). And we know that
isn't that interesting for optimizing, since by definition the file is
stat-dirty in that case (or else we'd skip the content-level comparison
entirely). So you'd have to compute the sha1 of the working tree file
from scratch. Plus the lifetime of a working tree file's entry in the
textconv cache is probably smaller, since it hasn't even been committed
yet.

I don't think I ever noticed because the primary thing I was trying to
speed up with the textconv cache is "git log -p", etc, which always has
an oid to work with.

But "grep" is a totally different story. It is frequently looking at all
of the stat-fresh working tree files.

-Peff