On 10/13/2017 10:20 AM, Jeff King wrote:
On Fri, Oct 13, 2017 at 10:10:18AM -0400, Jeff King wrote:
Hmm. So this patch makes it go fast:
diff --git a/revision.c b/revision.c
index d167223e69..b52ea4e9d8 100644
--- a/revision.c
+++ b/revision.c
@@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options,
int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
tree_difference |= diff;
- if (tree_difference == REV_TREE_DIFFERENT)
+ if (tree_difference & REV_TREE_DIFFERENT)
DIFF_OPT_SET(options, HAS_CHANGES);
}
But that essentially makes the conditional a noop (since we know we set
either NEW or OLD above and DIFFERENT is the union of those flags).
I'm not sure I understand why file_add_remove() would ever want to avoid
setting HAS_CHANGES (certainly its companion file_change() always does).
This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use
diff-tree --quiet machinery., 2007-03-14).
Maybe I am missing something, but AFAICT this was always buggy. But
since it only affects adds and deletes, maybe nobody noticed? I'm also
not sure if it only causes a slowdown, or if this could cause us to
erroneously mark something as TREESAME which isn't (I _do_ think people
would have noticed that).
Answering my own question a little, there is a hint in the comment
a few lines above:
/*
* The goal is to get REV_TREE_NEW as the result only if the
* diff consists of all '+' (and no other changes), REV_TREE_OLD
* if the whole diff is removal of old data, and otherwise
* REV_TREE_DIFFERENT (of course if the trees are the same we
* want REV_TREE_SAME).
* That means that once we get to REV_TREE_DIFFERENT, we do not
* have to look any further.
*/
So my patch above is breaking that. But it's not clear from that comment
why we care about knowing the different between NEW, OLD, and DIFFERENT.
Grepping around for REV_TREE_NEW and REV_TREE_OLD, I think the answer is
in try_to_simplify_commit():
case REV_TREE_NEW:
if (revs->remove_empty_trees &&
rev_same_tree_as_empty(revs, p)) {
/* We are adding all the specified
* paths from this parent, so the
* history beyond this parent is not
* interesting. Remove its parents
* (they are grandparents for us).
* IOW, we pretend this parent is a
* "root" commit.
*/
if (parse_commit(p) < 0)
die("cannot simplify commit %s (invalid %s)",
oid_to_hex(&commit->object.oid),
oid_to_hex(&p->object.oid));
p->parents = NULL;
}
/* fallthrough */
case REV_TREE_OLD:
case REV_TREE_DIFFERENT:
So when --remove-empty is not in effect (and it's not by default), we
don't care about OLD vs NEW, and we should be able to optimize further.
-Peff
This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is
causing diff_can_quit_early() to return false. Due to the corner-case of
the bug it seems it will not be a huge performance improvement in most
cases. Still worth fixing and I'm looking at your suggestions to try and
learn this area better.
It will speed up natural cases of someone adding or renaming a folder
with a lot of contents, including someone initializing a repository from
an existing codebase. This git bomb case happens to add all of the
folders at once, which is why the performance is so noticeable. I use a
version of this "exponential file growth" for testing that increases the
folder-depth one commit at a time, so the contents are rather small in
the first "add", hence not noticing this issue.
Thanks,
-Stolee