On Thu, Oct 15, 2020 at 03:21:47PM +0200, SZEDER Gábor wrote: > As pointed out in my original bug report [1], modified path Bloom > filters are computed with hardcoded settings in > bloom.c:get_bloom_filter(). Since this patch does not touch bloom.c > at all, it still computes Bloom filters with those hardcoded settings, > and, consequently, despite the commit message's claims, it does not > persist the settings in the existing commit-graph. > > [1] https://public-inbox.org/git/20200619140230.GB22200@xxxxxxxxxx/ Right. It is worth mentioning here (as you do below) that this was fixed as of 9a7a9ed10d (bloom: use provided 'struct bloom_filter_settings', 2020-09-16). > > Use the trace2 API to signal the settings used during the write, and > > check that output in a test after manually adjusting the correct bytes > > in the commit-graph file. > > This test is insufficient, as it only checks what settings trace2 > believes the Bloom filters are computed with, not what settings they > are actually computed with; that's why it succeeded while the bug > whose absence it was supposed to ensure was still there. > > More robust tests should instead look at what actually gets written to > the commit-graph, and how that is interpreted during pathspec-limited > revision walks. Thanks for the test! That saved me a little bit of work trying to set up the scenario you're describing in a reproducible way. I think that the third test can be fixed relatively easily. The crux of the issue there is that we are computing Bloom filters for commits, some of which already had Bloom filters computed in an earlier layer, but with different settings than the one that we're using to write the current layer. So, we need to be more aggressively checking the Bloom filter settings in any layer we want to reuse a Bloom filter out of before reusing it verbatim in the current layer. The patch below the scissors line is sufficient to do that, and it causes the third test to start passing. ...But, teaching the revision machinery how to handle a walk through commits in multiple commit-graph layers with incompatible Bloom filter settings is trickier. Right now we compute all of the Bloom keys up front using the Bloom filter settings in the top layer. I think we'd probably want to change this to lazily compute those keys by: - Keeping around the contents of the Bloom keys, i.e., the pathspecs themselves. - Keeping a hashmap from Bloom filter settings to computed Bloom keys corresponding to those settings. - Lazily filling in that hashmap as we traverse through different commits. At least, that's the best way that I can think to do it. It does get kind of slow, though; we'd have to scan every commit graph layer at each commit in the worst case to find the actual 'struct commit_graph *' in order to get its Bloom filter settings. So, I think that's sort of show-stoppingly slow, and that we should probably think more deeply about it before taking up that direction. Maybe Stolee has some better thoughts about how we can quickly go from a commit to either (a) the commit-graph struct that that commit is stored in, or (b) the Bloom filter settings belonging to that struct. Thanks, Taylor --- >8 --- Subject: [PATCH] bloom: recompute filters with incompatible settings --- bloom.c | 21 +++++++++++++++++++-- commit-graph.c | 4 ++-- 2 files changed, 21 insertions(+), 4 deletions(-) diff --git a/bloom.c b/bloom.c index 68c73200a5..30da128474 100644 --- a/bloom.c +++ b/bloom.c @@ -30,7 +30,8 @@ static inline unsigned char get_bitmask(uint32_t pos) static int load_bloom_filter_from_graph(struct commit_graph *g, struct bloom_filter *filter, - struct commit *c) + struct commit *c, + const struct bloom_filter_settings *settings) { uint32_t lex_pos, start_index, end_index; uint32_t graph_pos = commit_graph_position(c); @@ -42,6 +43,21 @@ static int load_bloom_filter_from_graph(struct commit_graph *g, if (!g->chunk_bloom_indexes) return 0; + if (settings) { + struct bloom_filter_settings *graph_settings = g->bloom_filter_settings; + /* + * Check that the settings used to generate new Bloom filters is + * compatible with the settings Bloom filters in this + * commit-graph layer were generated with. + */ + if (settings->hash_version != graph_settings->hash_version) + return 0; + if (settings->num_hashes != graph_settings->num_hashes) + return 0; + if (settings->bits_per_entry != graph_settings->bits_per_entry) + return 0; + } + lex_pos = graph_pos - g->num_commits_in_base; end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos); @@ -205,7 +221,8 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r, if (!filter->data) { load_commit_graph_info(r, c); if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH) - load_bloom_filter_from_graph(r->objects->commit_graph, filter, c); + load_bloom_filter_from_graph(r->objects->commit_graph, + filter, c, settings); } if (filter->data && filter->len) diff --git a/commit-graph.c b/commit-graph.c index cb042bdba8..afe14af2a3 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1188,7 +1188,7 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f, uint32_t cur_pos = 0; while (list < last) { - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list); + struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, *list, 0, ctx->bloom_settings, NULL); size_t len = filter ? filter->len : 0; cur_pos += len; display_progress(ctx->progress, ++ctx->progress_cnt); @@ -1228,7 +1228,7 @@ static int write_graph_chunk_bloom_data(struct hashfile *f, hashwrite_be32(f, ctx->bloom_settings->bits_per_entry); while (list < last) { - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list); + struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, *list, 0, ctx->bloom_settings, NULL); size_t len = filter ? filter->len : 0; display_progress(ctx->progress, ++ctx->progress_cnt); -- 2.29.0.rc1.dirty