On Thu, Jul 21, 2016 at 10:59:07PM +0200, larsxschneider@xxxxxxxxx wrote: > From: Lars Schneider <larsxschneider@xxxxxxxxx> > > If three files processed by a filter are added and committed to a > repository then I expect three clean invocations per Git operation. > Apparently Git invokes the clean process 12 times. > > Signed-off-by: Lars Schneider <larsxschneider@xxxxxxxxx> > --- > > It's pretty late here and I hope do not miss something obvious... but > can anyone explain to me why the clean operation is executed 12 times > for 3 files? Hmm. If I run this under gdb and backtrace when we hit apply_filter(), I see: > + rm -f run.count && > + git add . && > + test_line_count = 3 run.count This "git add" does invoke apply_filter() 3 times, which I'd expect. > + rm -f run.count && > + git commit . -m "test commit" && > + test_line_count = 3 run.count This invokes apply_filter() six times. So something funny is going on already (I do get 12 dots, so checking apply_filter() seems to only cover half the invocations). Three of them are for re-adding the three files to the index again, since "git commit ." is asking us to do so. I'm surprised, though; I would have thought we could avoid doing so just based on the stat information. Maybe it's a racy-index thing because the files' mtimes are likely to be the same as the index? Indeed, if I stick a "sleep 1" between modifying the files and calling "git add", then the "git commit" invocation drops to only 6 invocations of the filter. So that explains half of it (though I'm still confused why index refreshing requires 6 and not 3; I guess it may be because "git commit ." works in a separate index, and we first refresh that index, and then refresh the "real" index again afterwards, when we could theoretically just copy the entries). The next three are to show the final diffstat after the commit completes. It looks like the "should we reuse the worktree file?" optimization in diff_populate_filespec() does not take into account whether we will have to convert the contents, and it probably should. The point is that sometimes mmap-ing the file contents is more efficient than zlib inflating the object from disk. But if we have to exec an extra process and read the whole object contents into a strbuf, that is probably not a win. Adding a "return 0" at the top of reuse_worktree_file() drops our 6 to 3 (but obviously it should actually check the attributes). So of the 12 invocations: - 3 must be for refreshing the index again, because the way the test is written causes us to err on the side of caution when the mtimes are the same (and also means that even if your test is fixed to pass, it would racily fail when the system is under load) - 3 are for the double-refresh when "git commit ---only" is used (and "git commit ." implies "--only". Seems like there is room for optimization there. - 3 are for the tree-diff reusing the worktree files. This should be dropped. - The other 3, I'm not sure. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html