Sorry it took me a while to get back to this. Looking at the existing code, Bloom filters are passed around a lot without context, especially when writing - they are generated into a commit slab and then when it is time to write them to disk, they are taken from that commit slab. And rather than annotating where they are passed around, I thought it better to stick to the single-version approach in version 4 (per Git invocation and per repo, only one version), which also sidesteps what happens if there so happens to be multiple commit graphs each with their own Bloom filter version (not possible to be generated by Git but possible with a hex editor) and what happens if we want to write a different version than what is currently stored in the commit slab. But with an auto- detection of that version, I think we have what we need; in regular operation, Git will run with whatever the version on disk is, and when it is time to migrate, the user can explicitly specify the version. I did not implement the mitigation of not using the Bloom filters when a high-bit path is sought because, as Stolee says, this is useful only when mixing Git implementations and will slow down operations (without any increase in correctness) in the absence of such a mix [1]. But I can implement this if need be. [1] https://lore.kernel.org/git/e57b2272-b269-b705-3d42-d32e0b410f03@xxxxxxxxxx/ Jonathan Tan (4): gitformat-commit-graph: describe version 2 of BDAT t4216: test changed path filters with high bit paths repo-settings: introduce commitgraph.changedPathsVersion commit-graph: new filter ver. that fixes murmur3 Documentation/config/commitgraph.txt | 19 +++- Documentation/gitformat-commit-graph.txt | 9 +- bloom.c | 65 ++++++++++++- bloom.h | 8 +- commit-graph.c | 33 +++++-- oss-fuzz/fuzz-commit-graph.c | 2 +- repo-settings.c | 6 +- repository.h | 2 +- t/helper/test-bloom.c | 9 +- t/t0095-bloom.sh | 8 ++ t/t4216-log-bloom.sh | 117 +++++++++++++++++++++++ 11 files changed, 256 insertions(+), 22 deletions(-) Range-diff against v4: 1: a5955cda3d ! 1: 52e281eef0 gitformat-commit-graph: describe version 2 of BDAT @@ Documentation/gitformat-commit-graph.txt: All multi-byte numbers are in network hashing technique using seed values 0x293ae76f and 0x7e646e2 as described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters - in Probabilistic Verification" -+ in Probabilistic Verification". Version 1 bloom filters have a bug that appears ++ in Probabilistic Verification". Version 1 Bloom filters have a bug that appears + when char is signed and the repository has path names that have characters >= + 0x80; Git supports reading and writing them, but this ability will be removed + in a future version of Git. 2: 68732120f9 ! 2: 94a4c7af38 t4216: test changed path filters with high bit paths @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm +test_expect_success 'setup check value of version 1 changed-path' ' + (cd highbit1 && + printf "52a9" >expect && -+ get_first_changed_path_filter >actual) ++ get_first_changed_path_filter >actual && ++ test_cmp expect actual) +' + +# expect will not match actual if char is unsigned by default. Write the test 3: 44cbcc6a69 ! 3: 131095666d repo-settings: introduce commitgraph.changedPathsVersion @@ Commit message repo-settings: introduce commitgraph.changedPathsVersion A subsequent commit will introduce another version of the changed-path - filter in the commit graph file. In order to control which version is - to be accepted when read (and which version to write), a config variable - is needed. + filter in the commit graph file. In order to control which version to + write (and read), a config variable is needed. Therefore, introduce this config variable. For forwards compatibility, teach Git to not read commit graphs when the config variable @@ Commit message This commit does not change the behavior of writing (Git writes changed path filters when explicitly instructed regardless of any config variable), but a subsequent commit will restrict Git such that it will - only write when commitgraph.changedPathsVersion is 0, 1, or 2. + only write when commitgraph.changedPathsVersion is a recognized value. Signed-off-by: Jonathan Tan <jonathantanmy@xxxxxxxxxx> Signed-off-by: Junio C Hamano <gitster@xxxxxxxxx> @@ Documentation/config/commitgraph.txt: commitGraph.maxNewFilters:: - If true, then git will use the changed-path Bloom filters in the - commit-graph file (if it exists, and they are present). Defaults to - true. See linkgit:git-commit-graph[1] for more information. -+ Deprecated. Equivalent to changedPathsVersion=1 if true, and ++ Deprecated. Equivalent to changedPathsVersion=-1 if true, and + changedPathsVersion=0 if false. + +commitGraph.changedPathsVersion:: + Specifies the version of the changed-path Bloom filters that Git will read and -+ write. May be 0 or 1. Any changed-path Bloom filters on disk that do not ++ write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not + match the version set in this config variable will be ignored. ++ -+Defaults to 1. ++Defaults to -1. +++ ++If -1, Git will use the version of the changed-path Bloom filters in the ++repository, defaulting to 1 if there are none. ++ +If 0, git will write version 1 Bloom filters when instructed to write. ++ @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s, } - if (s->commit_graph_read_changed_paths) { -+ if (s->commit_graph_changed_paths_version == 1) { ++ if (s->commit_graph_changed_paths_version != 0) { pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES, &graph->chunk_bloom_indexes); read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA, @@ repo-settings.c: void prepare_repo_settings(struct repository *r) + repo_cfg_bool(r, "commitgraph.readchangedpaths", &readChangedPaths, 1); + repo_cfg_int(r, "commitgraph.changedpathsversion", + &r->settings.commit_graph_changed_paths_version, -+ readChangedPaths ? 1 : 0); ++ readChangedPaths ? -1 : 0); repo_cfg_bool(r, "gc.writecommitgraph", &r->settings.gc_write_commit_graph, 1); repo_cfg_bool(r, "fetch.writecommitgraph", &r->settings.fetch_write_commit_graph, 0); 4: 6dee3bfa70 ! 4: 47ba89c565 commit-graph: new filter ver. that fixes murmur3 @@ Commit message So this patch does not include any mechanism to "salvage" changed path filters from repositories. There is also no "mixed" mode - for each invocation of Git, reading and writing changed path filters are done - with the same version number. + with the same version number; this version number may be explicitly + stated (typically if the user knows which version they need) or + automatically determined from the version of the existing changed path + filters in the repository. There is a change in write_commit_graph(). graph_read_bloom_data() makes it possible for chunk_bloom_data to be non-NULL but @@ Documentation/config/commitgraph.txt: commitGraph.readChangedPaths:: commitGraph.changedPathsVersion:: Specifies the version of the changed-path Bloom filters that Git will read and -- write. May be 0 or 1. Any changed-path Bloom filters on disk that do not -+ write. May be 0, 1, or 2. Any changed-path Bloom filters on disk that do not +- write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not ++ write. May be -1, 0, 1, or 2. Any changed-path Bloom filters on disk that do not match the version set in this config variable will be ignored. + - Defaults to 1. + Defaults to -1. ## bloom.c ## @@ bloom.c: static int load_bloom_filter_from_graph(struct commit_graph *g, @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star +struct graph_read_bloom_data_data { + struct commit_graph *g; -+ int commit_graph_changed_paths_version; ++ int *commit_graph_changed_paths_version; +}; + static int graph_read_bloom_data(const unsigned char *chunk_start, @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star hash_version = get_be32(chunk_start); - if (hash_version != 1) -+ if (hash_version != d->commit_graph_changed_paths_version) - return 0; +- return 0; ++ if (*d->commit_graph_changed_paths_version == -1) { ++ *d->commit_graph_changed_paths_version = hash_version; ++ } else if (hash_version != *d->commit_graph_changed_paths_version) { ++ return 0; ++ } g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings)); + g->bloom_filter_settings->hash_version = hash_version; @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s, - graph->read_generation_data = 1; } -- if (s->commit_graph_changed_paths_version == 1) { -+ if (s->commit_graph_changed_paths_version == 1 -+ || s->commit_graph_changed_paths_version == 2) { + if (s->commit_graph_changed_paths_version != 0) { + struct graph_read_bloom_data_data data = { + .g = graph, -+ .commit_graph_changed_paths_version = s->commit_graph_changed_paths_version ++ .commit_graph_changed_paths_version = &s->commit_graph_changed_paths_version + }; pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES, &graph->chunk_bloom_indexes); @@ commit-graph.c: int write_commit_graph(struct object_directory *odb, ctx->write_generation_data = (get_configured_generation_version(r) == 2); ctx->num_generation_data_overflows = 0; -+ if (r->settings.commit_graph_changed_paths_version < 0 ++ if (r->settings.commit_graph_changed_paths_version < -1 + || r->settings.commit_graph_changed_paths_version > 2) { + warning(_("attempting to write a commit-graph, but 'commitgraph.changedPathsVersion' (%d) is not supported"), + r->settings.commit_graph_changed_paths_version); @@ t/t0095-bloom.sh: test_expect_success 'compute unseeded murmur3 hash for test st Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004| ## t/t4216-log-bloom.sh ## +@@ t/t4216-log-bloom.sh: get_bdat_offset () { + .git/objects/info/commit-graph + } + ++get_changed_path_filter_version () { ++ BDAT_OFFSET=$(get_bdat_offset) && ++ perl -0777 -ne \ ++ 'print unpack("H*", substr($_, '$BDAT_OFFSET', 4))' \ ++ .git/objects/info/commit-graph ++} ++ + get_first_changed_path_filter () { + BDAT_OFFSET=$(get_bdat_offset) && + perl -0777 -ne \ +@@ t/t4216-log-bloom.sh: test_expect_success 'set up repo with high bit path, version 1 changed-path' ' + git -C highbit1 commit-graph write --reachable --changed-paths + ' + +-test_expect_success 'setup check value of version 1 changed-path' ' ++test_expect_success 'check value of version 1 changed-path' ' + (cd highbit1 && + printf "52a9" >expect && + get_first_changed_path_filter >actual && @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when version 1 requested' ' test_bloom_filters_used "-- $CENT") ' @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers + test_bloom_filters_not_used "-- $CENT") +' + ++test_expect_success 'version 1 changed-path used when autodetect requested' ' ++ (cd highbit1 && ++ git config --add commitgraph.changedPathsVersion -1 && ++ test_bloom_filters_used "-- $CENT") ++' ++ ++test_expect_success 'when writing another commit graph, preserve existing version 1 of changed-path' ' ++ test_commit -C highbit1 c1double "$CENT$CENT" && ++ git -C highbit1 commit-graph write --reachable --changed-paths && ++ (cd highbit1 && ++ git config --add commitgraph.changedPathsVersion -1 && ++ printf "00000001" >expect && ++ get_changed_path_filter_version >actual && ++ test_cmp expect actual) ++' ++ +test_expect_success 'set up repo with high bit path, version 2 changed-path' ' + git init highbit2 && + git -C highbit2 config --add commitgraph.changedPathsVersion 2 && @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers + git config --add commitgraph.changedPathsVersion 1 && + test_bloom_filters_not_used "-- $CENT") +' ++ ++test_expect_success 'version 2 changed-path used when autodetect requested' ' ++ (cd highbit2 && ++ git config --add commitgraph.changedPathsVersion -1 && ++ test_bloom_filters_used "-- $CENT") ++' ++ ++test_expect_success 'when writing another commit graph, preserve existing version 2 of changed-path' ' ++ test_commit -C highbit2 c2double "$CENT$CENT" && ++ git -C highbit2 commit-graph write --reachable --changed-paths && ++ (cd highbit2 && ++ git config --add commitgraph.changedPathsVersion -1 && ++ printf "00000002" >expect && ++ get_changed_path_filter_version >actual && ++ test_cmp expect actual) ++' + test_done -- 2.41.0.255.g8b1d071c50-goog