On Mon, Oct 02, 2023 at 03:55:46PM -0700, Jonathan Tan wrote: > Taylor Blau <me@xxxxxxxxxxxx> writes: > > diff --git a/revision.c b/revision.c > > index 2f4c53ea20..1d36df49e2 100644 > > --- a/revision.c > > +++ b/revision.c > > @@ -837,14 +837,24 @@ static int rev_compare_tree(struct rev_info *revs, > > static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit) > > { > > struct tree *t1 = repo_get_commit_tree(the_repository, commit); > > + int bloom_ret = 1; > > > > if (!t1) > > return 0; > > > > + if (revs->bloom_keys_nr) { > > + bloom_ret = check_maybe_different_in_bloom_filter(revs, commit); > > + if (!bloom_ret) > > + return 1; > > + } > > + > > tree_difference = REV_TREE_SAME; > > revs->pruning.flags.has_changes = 0; > > diff_tree_oid(NULL, &t1->object.oid, "", &revs->pruning); > > > > + if (bloom_ret == 1 && tree_difference == REV_TREE_SAME) > > + count_bloom_filter_false_positive++; > > + > > return tree_difference == REV_TREE_SAME; > > } > > I'll concentrate on getting this patch in, and will look at (and > discuss) the other Bloom filter-related emails later. Sounds good. I know that I have some pending mail from SZEDER as well, so I'll try and apply any feedback from both of you before sending a reroll so that we can get this all done in one shot. > This looks good, possibly except a code path in try_to_simplify_commit() > that calls this rev_same_tree_as_empty() function when > rev_compare_tree() between a commit and its parent returns REV_TREE_NEW. > So there are 2 issues: How can rev_compare_tree() ever return > REV_TREE_NEW? And it doesn't seem right to check Bloom filters in this > code path, since rev_same_tree_as_empty() was invoked here while we are > enumerating through a commit's parents, which necessarily implies that > the commit has parents, but here we're using the Bloom filter as if the > commit is known to have no parents. > As for the first issue, rev_compare_tree() returns REV_TREE_NEW when the > parent's tree is NULL. I'm not sure how this can happen - the tree can > be NULL if the parent commit is not parsed, but at this point I think > that it has been parsed. And I think every commit has a tree. This goes > back all the way to 3a5e860815 (revision: make tree comparison functions > take commits rather than trees, 2008-11-03) and even beyond that (I > didn't dig further). The more I think about this, the more confused I become ;-). I was able to track this all the way back to Linus's 461cf59f89 (rev-list: stop when the file disappears, 2006-01-18), but am convinced that both of these cases (we have an analogous one for when t2 is NULL and we return REV_TREE_OLD) are dead code. I replaced the "if (!t1) return REV_TREE_NEW;" with "if (!t1) BUG("oops")", and was able to get the whole test suite to pass. So... I am pretty sure that this is dead code, but not sure enough to remove it myself ;-). To your point about seeing the tree as NULL before parsing the parent, I don't think that is the case here, since we parse the parent immediately before calling rev_compare_tree() (and indeed Git will refuse to read commit objects which do not list a tree). > As for the second issue, we can probably solve this by being defensive > in rev_same_tree_as_empty() by only using the Bloom filter when the > commit has no parents. Not sure if this is being overly defensive, > though. I am also unsure whether we are being overly defensive here or not. But I agree that it does feel safer to apply something like: --- 8< --- diff --git a/revision.c b/revision.c index 3d78ea6a9a..21b3085465 100644 --- a/revision.c +++ b/revision.c @@ -834,7 +834,8 @@ static int rev_compare_tree(struct rev_info *revs, return tree_difference; } -static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit) +static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit, + int use_bloom_filter) { struct tree *t1 = repo_get_commit_tree(the_repository, commit); int bloom_ret = 1; @@ -842,7 +843,7 @@ static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit) if (!t1) return 0; - if (revs->bloom_keys_nr) { + if (use_bloom_filter && revs->bloom_keys_nr) { bloom_ret = check_maybe_different_in_bloom_filter(revs, commit); if (!bloom_ret) return 1; @@ -892,7 +893,7 @@ static int compact_treesame(struct rev_info *revs, struct commit *commit, unsign if (nth_parent != 0) die("compact_treesame %u", nth_parent); old_same = !!(commit->object.flags & TREESAME); - if (rev_same_tree_as_empty(revs, commit)) + if (rev_same_tree_as_empty(revs, commit, 1)) commit->object.flags |= TREESAME; else commit->object.flags &= ~TREESAME; @@ -988,7 +989,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit) return; if (!commit->parents) { - if (rev_same_tree_as_empty(revs, commit)) + if (rev_same_tree_as_empty(revs, commit, 1)) commit->object.flags |= TREESAME; return; } @@ -1069,7 +1070,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit) case REV_TREE_NEW: if (revs->remove_empty_trees && - rev_same_tree_as_empty(revs, p)) { + rev_same_tree_as_empty(revs, p, 0)) { /* We are adding all the specified * paths from this parent, so the * history beyond this parent is not --- >8 --- on top. > There is also the issue that count_bloom_filter_false_positive is > incremented even when no Bloom filters are present, but I think this is > fine (it matches the behavior of rev_compare_tree()). Yep, agreed. Thanks, Taylor