Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'

Taylor Blau <me@xxxxxxxxxxxx> · Thu, 3 Sep 2020 14:49:20 -0400

On Tue, Sep 01, 2020 at 04:36:50PM +0200, SZEDER Gábor wrote:
> Something seems to be wrong in this patch, though I haven't looked
> closer.  Consider this test with a bit of makeshift tracing:
>
>   ---  >8  ---
>
> diff --git a/bloom.c b/bloom.c
> index 8d07209c6b..1a0dec35cd 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -222,6 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
>  	if (!compute_if_not_present)
>  		return NULL;
>
> +	printf("get_or_compute_bloom_filter(): diff: %s\n", oid_to_hex(&c->object.oid));
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
>  	diffopt.detect_rename = 0;
> diff --git a/t/t9999-test.sh b/t/t9999-test.sh
> new file mode 100755
> index 0000000000..0833e6ff7e
> --- /dev/null
> +++ b/t/t9999-test.sh
> @@ -0,0 +1,25 @@
> +#!/bin/sh
> +
> +test_description='test'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'test' '
> +	for i in 1 2 3 4 5 6
> +	do
> +		git commit -q --allow-empty -m $i || return 1
> +	done &&
> +	git log --oneline &&
> +
> +	# We have 6 commits and compute 2 Bloom filters per execution,
> +	# so 3 executions should be enough...  but, alas, it isnt.
> +	for i in 1 2 3 4 5
> +	do
> +		# No --split=replace!
> +		git commit-graph write --reachable --changed-paths --max-new-filters=2 || return 1
> +	done &&
> +
> +	git commit-graph write --reachable --changed-paths --max-new-filters=4
> +'
> +
> +test_done
>
>   ---  >8  ---
>
> It's output looks like:
>
>   [...]
>   + git log --oneline
>   13fcefa (HEAD -> master) 6
>   0c71516 5
>   a82a61c 4
>   54c6b2c 3
>   fc99def 2
>   a3a8cd3 1
>   + git commit-graph write --reachable --changed-paths --max-new-filters=2
>   get_or_compute_bloom_filter(): diff: 0c71516945bf0a23813e80205961d29ebc1020e0
>   get_or_compute_bloom_filter(): diff: 13fcefa4bb859a15c4edc6bb01b8b6c91b4f32b6
>   + git commit-graph write --reachable --changed-paths --max-new-filters=2
>   get_or_compute_bloom_filter(): diff: 54c6b2cd4fb50066683a197cc6d677689618505a
>   get_or_compute_bloom_filter(): diff: a3a8cd3c82028671bf51502d77277baf14a2f528
>   + git commit-graph write --reachable --changed-paths --max-new-filters=2
>   get_or_compute_bloom_filter(): diff: 0c71516945bf0a23813e80205961d29ebc1020e0
>   get_or_compute_bloom_filter(): diff: 13fcefa4bb859a15c4edc6bb01b8b6c91b4f32b6
>   + git commit-graph write --reachable --changed-paths --max-new-filters=2
>   get_or_compute_bloom_filter(): diff: 54c6b2cd4fb50066683a197cc6d677689618505a
>   get_or_compute_bloom_filter(): diff: a3a8cd3c82028671bf51502d77277baf14a2f528
>   + git commit-graph write --reachable --changed-paths --max-new-filters=2
>   get_or_compute_bloom_filter(): diff: 0c71516945bf0a23813e80205961d29ebc1020e0
>   get_or_compute_bloom_filter(): diff: 13fcefa4bb859a15c4edc6bb01b8b6c91b4f32b6
>   + git commit-graph write --reachable --changed-paths
>   get_or_compute_bloom_filter(): diff: 54c6b2cd4fb50066683a197cc6d677689618505a
>   get_or_compute_bloom_filter(): diff: a3a8cd3c82028671bf51502d77277baf14a2f528
>   get_or_compute_bloom_filter(): diff: a82a61c79b2b07c4440e292613e11a69e33ef7a2
>   get_or_compute_bloom_filter(): diff: fc99def8b1df27bcab7d1f4b7ced73239f9bd7ec
>
> See how the third write with '--max-new-filters=2' computes the
> filters that have already been computed by the first write instead of
> those two that have never been computed?  And then how the fourth
> write computes filters that have already been computed by the second
> write?  And ultimately we'll need a write without '--max-new-filters' (or
> with '--max-new-filters=<large-enough>') to compute all remaining
> filters.

Ouch. This definitely has to do with the empty commits, since swapping
out your 'git commit ... --allow-empty' for a 'test_commit' produces the
output that you'd expect.

> With '--split=replace' it appears to work as expected.

This is definitely the critical bit. The crux of the issue is that
'copy_oids_to_commits()' handles split and non-split graphs differently.
The critical bits here are:

  * 43d3561805 (commit-graph write: don't die if the existing graph is
    corrupt, 2019-03-25) which forces the relevant data to *not* be
    loaded from an existing commit-graph, and

  * 8a6ac287b2 (builtin/commit-graph.c: introduce split strategy
    'replace', 2020-04-13), which does load data from an existing
    commit-graph with '--split=replace'.

When writing a graph with '--split=replace', commits are loaded from the
graph, which includes setting their '->graph_pos' (or rather setting
this data in a commit slab, which is I guess how it's done these days).
Without '--split=replace', the graph position will never be set.

So, by the time we get to 'get_bloom_filter_large_in_graph', the graph
position is 'COMMIT_NOT_FROM_GRAPH', which in turn forces us to
recompute the filter from scratch, since we assume that being
'NOT_FROM_GRAPH' implies that we won't find it in any 'BFXL' chunk.

Regardless of whether or not we should be trusting the parentage
information on-disk, recomputing the Bloom filters from scratch is
simply too expensive (and the opposite of the point of this series). So,
doing the following to force 'get_bloom_filter_large_in_graph' to lookup
the Bloom and BFXL data in a commit graph by forcibly loading its graph
position is the right thing to do.

This is sufficient to get us unstuck:

diff --git a/commit-graph.c b/commit-graph.c
index bec4e5b725..243c7253ff 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -956,8 +956,8 @@ int get_bloom_filter_large_in_graph(struct commit_graph *g,
                                    const struct commit *c,
                                    uint32_t max_changed_paths)
 {
-       uint32_t graph_pos = commit_graph_position(c);
-       if (graph_pos == COMMIT_NOT_FROM_GRAPH)
+       uint32_t graph_pos;
+       if (!find_commit_in_graph(c, g, &graph_pos))
                return 0;

        while (g && graph_pos < g->num_commits_in_base)

...but adding something like:

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index a7a3b41919..571676cef2 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -310,4 +310,29 @@ test_expect_success 'Bloom generation backfills previously-skipped filters' '
        )
 '

+test_expect_success 'Bloom generation backfills empty commits' '
+       git init empty &&
+       test_when_finished "rm -fr empty" &&
+       (
+               cd empty &&
+               for i in $(test_seq 1 6)
+               do
+                       git commit --allow-empty -m "$i"
+               done &&
+
+               # Generate Bloom filters for empty commits 1-6, two at a time.
+               test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+                       0 2 2 &&
+               test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+                       2 2 2 &&
+               test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+                       4 2 2 &&
+
+               # Finally, make sure that once all commits have filters, that
+               # none are subsequently recomputed.
+               test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+                       6 0 0
+       )
+'
+
 test_done

is a good idea to harden what you wrote in your t9999 into an actual
test to prevent against regression. I'll fold both of those into this
patch.

Thanks for the bug report. It led to an interesting investigation as a
result :).

Thanks,
Taylor