Re: git grep --threads 12 --textconv is effectively single-threaded

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 10 Jul 2020 09:43:15 -0700

Jeff King <peff@xxxxxxxx> writes:

> For a tree-to-tree or index-to-tree comparison, both sides will have an
> oid and can use the textconv cache. Even for an index case where we
> might choose to use a stat-fresh working tree file as an optimization,
> we'll still consult the textconv cache before loading those contents.

OK, that "we'll still consult" part does sound like sort-of
"clever", but after thinking about the whole sentence twice,
I realize the actual cleverness lies in the reuse of stat-fresh
working tree file, not the use of textconv cache ;-)  The cache
is just doing its normal thing: if we know the oid, look up the
cached one.

> But for diffing a file in the working tree, we'll never have an oid and
> will always run the textconv command).

OK.  In such a case, we need to run the clean filter on the working
tree contents, and then finally we need to run the textconv on the
result.  We could internally hash the result of applying the clean
filter to see if we have the blob in the object database and use it
as the look-up key in the textconv cache, but we are talking about
the working tree files, which by definition is more fluid than what
is in the index, which is likely more fluid than what is already
committed, so the chance of finding a hit may be slim.

We could still see if the oid in the index is correct with the stat
check, but by definition, diff-files won't compare between a cache
entry and a working tree file if the path is stat-clean, so that
does not help all that much.  I wonder if diff-index comparing a
tree and the working tree (i.e. without "--cached") can be improved,
though.

> So "git diff" against the index,
> for example, would run _one_ textconv (using the cached value for the
> index, and running one for the working tree version). And we know that
> isn't that interesting for optimizing, since by definition the file is
> stat-dirty in that case (or else we'd skip the content-level comparison
> entirely).

Yup, we reached the same conclusion here ;-)

> So you'd have to compute the sha1 of the working tree file
> from scratch. Plus the lifetime of a working tree file's entry in the
> textconv cache is probably smaller, since it hasn't even been committed
> yet.

Yes again.

> I don't think I ever noticed because the primary thing I was trying to
> speed up with the textconv cache is "git log -p", etc, which always has
> an oid to work with.

Absolutely.

> But "grep" is a totally different story. It is frequently looking at all
> of the stat-fresh working tree files.

Yeah.  Grepping in the working tree files could be optimized with
the same technique that would be used to optimize diff-index without
"--cached".  When we look at the working tree file, we consult the
index and possibly learn its object name if the path is stat-clean,
probably if and only if textconv is in use.

"git grep" would divert to the "grep in a blob object" codepath from
there, and "git diff-index" would make clever use of oid_valid bit
in diff_filespec when running textconv (we might need a separate
bit, though; I haven't thought it through).

Or something like that.