Re: [PATCH v1] convert: add test to demonstrate clean invocations

Jeff King <peff@xxxxxxxx> · Thu, 21 Jul 2016 17:37:40 -0400

On Thu, Jul 21, 2016 at 10:59:07PM +0200, larsxschneider@xxxxxxxxx wrote:

> From: Lars Schneider <larsxschneider@xxxxxxxxx>
> 
> If three files processed by a filter are added and committed to a
> repository then I expect three clean invocations per Git operation.
> Apparently Git invokes the clean process 12 times.
> 
> Signed-off-by: Lars Schneider <larsxschneider@xxxxxxxxx>
> ---
> 
> It's pretty late here and I hope do not miss something obvious... but
> can anyone explain to me why the clean operation is executed 12 times
> for 3 files?

Hmm. If I run this under gdb and backtrace when we hit apply_filter(), I
see:

> +		rm -f run.count &&
> +		git add . &&
> +		test_line_count = 3 run.count

This "git add" does invoke apply_filter() 3 times, which I'd expect.

> +		rm -f run.count &&
> +		git commit . -m "test commit" &&
> +		test_line_count = 3 run.count

This invokes apply_filter() six times. So something funny is going on
already (I do get 12 dots, so checking apply_filter() seems to only
cover half the invocations).

Three of them are for re-adding the three files to the index again,
since "git commit ." is asking us to do so. I'm surprised, though; I
would have thought we could avoid doing so just based on the stat
information. Maybe it's a racy-index thing because the files' mtimes are
likely to be the same as the index?

Indeed, if I stick a "sleep 1" between modifying the files and calling
"git add", then the "git commit" invocation drops to only 6 invocations
of the filter. So that explains half of it (though I'm still confused
why index refreshing requires 6 and not 3; I guess it may be because
"git commit ." works in a separate index, and we first refresh that
index, and then refresh the "real" index again afterwards, when we could
theoretically just copy the entries).

The next three are to show the final diffstat after the commit
completes. It looks like the "should we reuse the worktree file?"
optimization in diff_populate_filespec() does not take into account
whether we will have to convert the contents, and it probably should.
The point is that sometimes mmap-ing the file contents is more efficient
than zlib inflating the object from disk. But if we have to exec an
extra process and read the whole object contents into a strbuf, that is
probably not a win.

Adding a "return 0" at the top of reuse_worktree_file() drops our 6 to
3 (but obviously it should actually check the attributes).

So of the 12 invocations:

  - 3 must be for refreshing the index again, because the way the test
    is written causes us to err on the side of caution when the mtimes
    are the same (and also means that even if your test is fixed to
    pass, it would racily fail when the system is under load)

  - 3 are for the double-refresh when "git commit ---only" is used (and
    "git commit ." implies "--only". Seems like there is room for
    optimization there.

  - 3 are for the tree-diff reusing the worktree files. This should be
    dropped.

  - The other 3, I'm not sure.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html