Re: [PATCH] diff: do not reuse worktree files that need "clean" conversion

Jeff King <peff@xxxxxxxx> · Fri, 22 Jul 2016 14:10:24 -0400

On Fri, Jul 22, 2016 at 10:44:08AM -0700, Junio C Hamano wrote:

> > diff --git a/diff.c b/diff.c
> > index 7d03419..b43d3dd 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -2683,6 +2683,13 @@ static int reuse_worktree_file(const char *name, const unsigned char *sha1, int
> >  	if (!FAST_WORKING_DIRECTORY && !want_file && has_sha1_pack(sha1))
> >  		return 0;
> >  
> > +	/*
> > +	 * Similarly, if we'd have to convert the file contents anyway, that
> > +	 * makes the optimization not worthwhile.
> > +	 */
> > +	if (!want_file && would_convert_to_git(name))
> > +		return 0;
> 
> The would_convert_to_git() function is not a free operation.  It
> needs to prime the attribute stack, so it needs to open/read/parse a
> few files ($GIT_DIR/info/attributes and .gitattributes at least, and
> more if your directory hierarchy is deep) on the filesystem.  The
> cost is amortized across paths, but we do not even enable the
> optimization if we have to pay the cost of reading the index
> ourselves.

Yeah, I almost commented on that, and its position in the function, but
forgot to.

The only code path which will trigger this is diff_populate_filespec.
After reuse_worktree_file() says "yes, we can reuse", we drop into a
conditional that will end in us calling convert_to_git() anyway, which
will do the same lookup. We don't need to cache, because the expensive
parts of the attribute-lookup are already cached for us by the attribute
code.

So my initial thought was to put it at the end of reuse_worktree_file(),
where it would have the least impact. If we find we cannot reuse the
file, then we would skip both this new attr lookup and the one in
diff_populate_filespec().

But in practice, I think we'll already have cached those attrs before we
even hit this function, because we'll hit the userdiff_find_by_path()
code earlier in the diff process (e.g., to see if it's binary, if we
need to textconv, etc). Those look for different attributes, but I think
the expensive bits (finding, opening, reading attribute files) are
cached across all lookups.

So I think we actually _can_ think of would_convert_to_git() as
basically free. Or as free as other efficient-lookup functions we call
like cache_name_pos(). And so I moved it further up in the function,
where it lets us avoid doing more out-of-process work (like calling
lstat() so we can ce_match_stat() on the result).

Possibly it should go after the cache_name_pos() call, though. That's
likely to be less expensive than the actual walk of the attr tree.

> I suspect that we may be better off disabling this optimization if
> we need to always call the helper.

The thought "does this tree reuse even speed things up enough to justify
its complexity" definitely crossed my mind. It's basically swapping
open/mmap for zlib inflating the content.

But I do think it helps in the "want_file" case (i.e., when we are
writing out a tempfile for an external command via prepare_temp_file()).
There it helps us omit writing a tempfile to disk entirely, including
any possible conversion.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html