SZEDER Gábor <szeder.dev@xxxxxxxxx> writes: > Note that this is not the settings instance that eventually gets > written to the header of the Bloom filters chunk: > write_commit_graph_file() has its own 'struct bloom_filter_settings' > instance, and that's the one that goes into the chunk header. > ... > Unfortunately, the parameters used to compute the now reused Bloom > filters are not checked anywhere. In fact this writing process > entirely ignores all parameters in the header of the existing Bloom > filters chunk, and simply replaces them with the default parameters > hard-coded in write_commit_graph_file(). Consequently, we can end up > with Bloom filters computed with different parameters in the same > commit-graph file, which, in turn, can result in commits omitted from > the output of pathspec-limited revision walks. Yeah, the whole design seems quite broken and as you said later, mixing other ingredients like split file would only make things worse X-<. > The makeshift (there is no way to override those hard-coded defaults) > tests below demonstrate this issue. > > This issue raises a good couple of questions: > > - What should we do when updating a commit-graph that was written > with different Bloom filter parameters than our hardcoded > defaults? > > Reusing the exising Bloom filters is clearly wrong. Throwing away > all existing Bloom filters and recomputing them with our defaults > parameters doesn't seem to be good option, because that's a > considerable amount of work, and the user might have a reason to > chose those parameters. > > - What should we do when updating a commit-graph that was written > with different Bloom filter parameters than specified by the user > on the command line or in the config? > > Wipe out the old Bloom filters and recompute with new parameters, > spending considerable time in bigger repositories? Or stop with a > warning about the different parameters (maybe it's just a typo), > and require '--force'? > > Dunno, and we don't have such options and configuration yet > anyway. > > - What about split commit-graphs? > > When split commit-graphs were introduced there was not a single > chunk that had its own header. Now the Bloom filters chunk does > have a header, which leads to other questions: > > - Should that Bloom filters header be included in every split > commit-graph? > > Not sure, but I suppose that having a header in each split > commit-graph file would make loading and parsing that chunk a > bit simpler, because all of them should be parsed the same way. > Anyway, I think the specs should be explicit about it. But...: > > - Should we allow different parameters in the Bloom filter chunks > in each split commit-graph? > > The point of split commit-graphs is to avoid the overhead of > re-writing the whole commit-graph file every time new commits > are added, and it's crucial that both writing and merging split > commit-graph files are cheap. However, split commit-graph files > using different Bloom filter parameters can't be merged without > recomputing those Bloom filters, making merging quite expensive. > > So I don't think that it's a good idea to allow different Bloom > filter parameters in split commit-graphs. But then perhaps it > would be better not to have a Bloom filter chunk header in all > split commit-graph files after all. > > In any case, the last test below shows that the Bloom filter > parameters are only read from the header of the most recent split > commit-graph file. > > > --- >8 --- > > #!/bin/sh > > test_description='test' > > . ./test-lib.sh > > test_expect_success 'yuckiest setup ever!' ' > ( > cd "$GIT_BUILD_DIR" && > > # The number of hashes per path cannot be configured > # at runtime, so build a dedicated git binary that > # writes Bloom filters using only 6 hashes per path. > sed -i -e "/DEFAULT_BLOOM_FILTER_SETTINGS/ s/7/6/" bloom.h && > make -j4 git && > cp git git6 && > > # Revert, rebuild. > sed -i -e "/DEFAULT_BLOOM_FILTER_SETTINGS/ s/6/7/" bloom.h && > make -j4 git > ) && > git6="$GIT_BUILD_DIR"/git6 > ' > > test_expect_success 'setup' ' > # We need a filename whose 7th hash maps to a different bit > # position than any of its first 6 hashes in a 2-byte Bloom > # filter. > file=File && > > test_tick && > git commit --allow-empty -m initial && > echo 1 >$file && > git add $file && > git commit -m one $file && > echo 2 >$file && > git commit -m two $file && > > git log --oneline -- $file >expect > ' > > test_expect_success 'can read Bloom filters with different parameters' ' > test_when_finished "rm -rfv .git/objects/info/commit-graph*" && > > # Write a commit-graph with Bloom filters using only 6 hashes > # per path. > "$git6" commit-graph write --reachable --changed-paths && > > # Try pathspec-limited revision walk with the git binary writing > # Bloom filters using 7 hashes: it still works, because no matter > # how many hashes it would use when writing the commit-graph, the > # reader part respects the nr of hashes stored in the > # commit-graph file. So far so good. > git log --oneline $file >actual && > test_cmp expect actual > ' > > test_expect_failure 'commit-graph write does not reuse Bloom filters with different parameters' ' > test_when_finished "rm -rfv .git/objects/info/commit-graph*" && > > # Write a commit-graph with Bloom filters using only 6 hashes > # per path for a subset of commits. > git rev-parse HEAD^ | > "$git6" commit-graph write --stdin-commits --changed-paths && > > # Add the rest of the commits to the commit-graph containing Bloom > # filters using 6 hashes with a git version that writes Bloom > # filters using 7 hashes. > # Does it reuse the existing Bloom filters with 6 hashes? > git commit-graph write --reachable --changed-paths && > > # Yes, it does, because these report different filter data, > # even though both commits modified the same file. > test-tool bloom get_filter_for_commit $(git rev-parse HEAD^) && > test-tool bloom get_filter_for_commit $(git rev-parse HEAD) && > > # Furthermore, it updated the Bloom filter chunk header as well, > # which now stores that all Bloom filters use 7 hashes. > # Consequently, the first commit whose Bloom filter was written > # with only 6 hashes falls victim of a false negative, and is > # omitted from the output. > git log --oneline $file >actual && > test_cmp expect actual > ' > > test_expect_failure 'split commit-graphs and Bloom filters with different parameters' ' > test_when_finished "rm -rfv .git/objects/info/commit-graph*" && > > git rev-parse HEAD^ | > "$git6" commit-graph write --stdin-commits --changed-paths --split && > > git commit-graph write --reachable --changed-paths --split=no-merge && > > # To make sure that I test what I want, i.e. two commit-graphs > # with one commit in each. (Though "test-tool read-graph" is > # utterly oblivious to split commit graphs...) > test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain && > verbose test "$(test-tool read-graph |sed -n -e "s/^num_commits: //p")" = 1 && > > test-tool bloom get_filter_for_commit $(git rev-parse HEAD^) && > test-tool bloom get_filter_for_commit $(git rev-parse HEAD) && > > git log --oneline $file >actual && > test_cmp expect actual > ' > > test_done > > --- 8< --- > >> + if (filter->data || !compute_if_not_present) >> + return filter; >> + >> repo_diff_setup(r, &diffopt); >> diffopt.flags.recursive = 1; >> diffopt.max_changes = max_changes; >> diff --git a/bloom.h b/bloom.h >> index 85ab8e9423d..760d7122374 100644 >> --- a/bloom.h >> +++ b/bloom.h >> @@ -32,6 +32,7 @@ struct bloom_filter_settings { >> >> #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 } >> #define BITS_PER_WORD 8 >> +#define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t) >> >> /* >> * A bloom_filter struct represents a data segment to >> @@ -79,6 +80,7 @@ void add_key_to_filter(const struct bloom_key *key, >> void init_bloom_filters(void); >> >> struct bloom_filter *get_bloom_filter(struct repository *r, >> - struct commit *c); >> + struct commit *c, >> + int compute_if_not_present); >> >> #endif >> \ No newline at end of file >> diff --git a/commit-graph.c b/commit-graph.c >> index a8b6b5cca5d..77668629e27 100644 >> --- a/commit-graph.c >> +++ b/commit-graph.c >> @@ -1086,7 +1086,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f, >> ctx->commits.nr); >> >> while (list < last) { >> - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list); >> + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0); >> cur_pos += filter->len; >> display_progress(progress, ++i); >> hashwrite_be32(f, cur_pos); >> @@ -1115,7 +1115,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f, >> hashwrite_be32(f, settings->bits_per_entry); >> >> while (list < last) { >> - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list); >> + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0); >> display_progress(progress, ++i); >> hashwrite(f, filter->data, filter->len * sizeof(unsigned char)); >> list++; >> @@ -1296,7 +1296,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx) >> >> for (i = 0; i < ctx->commits.nr; i++) { >> struct commit *c = sorted_commits[i]; >> - struct bloom_filter *filter = get_bloom_filter(ctx->r, c); >> + struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1); >> ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len; >> display_progress(progress, i + 1); >> } >> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c >> index f18d1b722e1..ce412664ba9 100644 >> --- a/t/helper/test-bloom.c >> +++ b/t/helper/test-bloom.c >> @@ -39,7 +39,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid) >> struct bloom_filter *filter; >> setup_git_directory(); >> c = lookup_commit(the_repository, commit_oid); >> - filter = get_bloom_filter(the_repository, c); >> + filter = get_bloom_filter(the_repository, c, 1); >> print_bloom_filter(filter); >> } >> >> -- >> gitgitgadget