Thanks, Junio and Taylor, for your reviews. I have included Taylor's patches in this patch set. There seemed to be some merge conflicts when I tried to apply the patches Taylor provided on the base that I built my patches on (that is, the base of jt/path-filter-fix, namely, maint-2.40), so I have rebased all my patches onto latest master. Jonathan Tan (4): gitformat-commit-graph: describe version 2 of BDAT t4216: test changed path filters with high bit paths repo-settings: introduce commitgraph.changedPathsVersion commit-graph: new filter ver. that fixes murmur3 Taylor Blau (3): t/helper/test-read-graph.c: extract `dump_graph_info()` bloom.h: make `load_bloom_filter_from_graph()` public t/helper/test-read-graph: implement `bloom-filters` mode Documentation/config/commitgraph.txt | 26 +++++- Documentation/gitformat-commit-graph.txt | 9 +- bloom.c | 75 ++++++++++++++-- bloom.h | 13 ++- commit-graph.c | 33 +++++-- oss-fuzz/fuzz-commit-graph.c | 2 +- repo-settings.c | 6 +- repository.h | 2 +- t/helper/test-bloom.c | 9 +- t/helper/test-read-graph.c | 67 ++++++++++++--- t/t0095-bloom.sh | 8 ++ t/t4216-log-bloom.sh | 104 +++++++++++++++++++++++ 12 files changed, 315 insertions(+), 39 deletions(-) Range-diff against v5: 1: efe7f40fed = 1: 3ce6090a4d gitformat-commit-graph: describe version 2 of BDAT -: ---------- > 2: 1955734d1f t/helper/test-read-graph.c: extract `dump_graph_info()` -: ---------- > 3: 4cf7c2f634 bloom.h: make `load_bloom_filter_from_graph()` public -: ---------- > 4: 47b55758e6 t/helper/test-read-graph: implement `bloom-filters` mode 2: f684d07971 ! 5: 5276e6a90e t4216: test changed path filters with high bit paths @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm ) ' -+get_bdat_offset () { -+ perl -0777 -ne \ -+ 'print unpack("N", "$1") if /BDAT\0\0\0\0(....)/ or exit 1' \ -+ .git/objects/info/commit-graph -+} -+ +get_first_changed_path_filter () { -+ BDAT_OFFSET=$(get_bdat_offset) && -+ perl -0777 -ne \ -+ 'print unpack("H*", substr($_, '$BDAT_OFFSET' + 12, 2))' \ -+ .git/objects/info/commit-graph ++ test-tool read-graph bloom-filters >filters.dat && ++ head -n 1 filters.dat +} + +# chosen to be the same under all Unicode normalization forms @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm + +test_expect_success 'setup check value of version 1 changed-path' ' + (cd highbit1 && -+ printf "52a9" >expect && ++ echo "52a9" >expect && + get_first_changed_path_filter >actual && + test_cmp expect actual) +' 3: 2fadd87063 ! 6: dc3f6d2d4f repo-settings: introduce commitgraph.changedPathsVersion @@ Documentation/config/commitgraph.txt: commitGraph.maxNewFilters:: - If true, then git will use the changed-path Bloom filters in the - commit-graph file (if it exists, and they are present). Defaults to - true. See linkgit:git-commit-graph[1] for more information. -+ Deprecated. Equivalent to changedPathsVersion=-1 if true, and -+ changedPathsVersion=0 if false. ++ Deprecated. Equivalent to commitGraph.changedPathsVersion=-1 if true, and ++ commitGraph.changedPathsVersion=0 if false. (If commitGraph.changedPathVersion ++ is also set, commitGraph.changedPathsVersion takes precedence.) + +commitGraph.changedPathsVersion:: + Specifies the version of the changed-path Bloom filters that Git will read and -+ write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not -+ match the version set in this config variable will be ignored. ++ write. May be -1, 0 or 1. ++ +Defaults to -1. ++ +If -1, Git will use the version of the changed-path Bloom filters in the +repository, defaulting to 1 if there are none. ++ -+If 0, git will write version 1 Bloom filters when instructed to write. ++If 0, Git will not read any Bloom filters, and will write version 1 Bloom ++filters when instructed to write. +++ ++If 1, Git will only read version 1 Bloom filters, and will write version 1 ++Bloom filters. ++ +See linkgit:git-commit-graph[1] for more information. @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s, } - if (s->commit_graph_read_changed_paths) { -+ if (s->commit_graph_changed_paths_version != 0) { ++ if (s->commit_graph_changed_paths_version) { pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES, &graph->chunk_bloom_indexes); read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA, @@ repo-settings.c: void prepare_repo_settings(struct repository *r) int value; const char *strval; int manyfiles; -+ int readChangedPaths; ++ int read_changed_paths; if (!r->gitdir) BUG("Cannot add settings for uninitialized repository"); @@ repo-settings.c: void prepare_repo_settings(struct repository *r) repo_cfg_bool(r, "core.commitgraph", &r->settings.core_commit_graph, 1); repo_cfg_int(r, "commitgraph.generationversion", &r->settings.commit_graph_generation_version, 2); - repo_cfg_bool(r, "commitgraph.readchangedpaths", &r->settings.commit_graph_read_changed_paths, 1); -+ repo_cfg_bool(r, "commitgraph.readchangedpaths", &readChangedPaths, 1); ++ repo_cfg_bool(r, "commitgraph.readchangedpaths", &read_changed_paths, 1); + repo_cfg_int(r, "commitgraph.changedpathsversion", + &r->settings.commit_graph_changed_paths_version, -+ readChangedPaths ? -1 : 0); ++ read_changed_paths ? -1 : 0); repo_cfg_bool(r, "gc.writecommitgraph", &r->settings.gc_write_commit_graph, 1); repo_cfg_bool(r, "fetch.writecommitgraph", &r->settings.fetch_write_commit_graph, 0); @@ repository.h: struct repo_settings { - int commit_graph_read_changed_paths; + int commit_graph_changed_paths_version; int gc_write_commit_graph; - int gc_cruft_packs; int fetch_write_commit_graph; + int command_requires_full_index; 4: e31711ae85 ! 7: 6e2d797406 commit-graph: new filter ver. that fixes murmur3 @@ Commit message controllable by a compiler option, and the default signedness of char is platform-specific). When a string contains characters with the high bit set, this bug causes results that, although internally consistent within - Git, does not accord with other implementations of murmur3 and even with - Git binaries that were compiled with different signedness of char. This - bug affects both how Git writes changed path filters to disk and how Git - interprets changed path filters on disk. + Git, does not accord with other implementations of murmur3 (thus, + the changed path filters wouldn't be readable by other off-the-shelf + implementatios of murmur3) and even with Git binaries that were compiled + with different signedness of char. This bug affects both how Git writes + changed path filters to disk and how Git interprets changed path filters + on disk. Therefore, introduce a new version (2) of changed path filters that corrects this problem. The existing version (1) is still supported and @@ Documentation/config/commitgraph.txt: commitGraph.readChangedPaths:: commitGraph.changedPathsVersion:: Specifies the version of the changed-path Bloom filters that Git will read and -- write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not -+ write. May be -1, 0, 1, or 2. Any changed-path Bloom filters on disk that do not - match the version set in this config variable will be ignored. +- write. May be -1, 0 or 1. ++ write. May be -1, 0, 1, or 2. + Defaults to -1. + + +@@ Documentation/config/commitgraph.txt: filters when instructed to write. + If 1, Git will only read version 1 Bloom filters, and will write version 1 + Bloom filters. + + ++If 2, Git will only read version 2 Bloom filters, and will write version 2 ++Bloom filters. +++ + See linkgit:git-commit-graph[1] for more information. ## bloom.c ## -@@ bloom.c: static int load_bloom_filter_from_graph(struct commit_graph *g, +@@ bloom.c: int load_bloom_filter_from_graph(struct commit_graph *g, * Not considered to be cryptographically secure. * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm */ @@ bloom.c: void fill_bloom_key(const char *data, const uint32_t seed1 = 0x7e646e2c; - const uint32_t hash0 = murmur3_seeded(seed0, data, len); - const uint32_t hash1 = murmur3_seeded(seed1, data, len); -+ const uint32_t hash0 = (settings->hash_version == 2 -+ ? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len); -+ const uint32_t hash1 = (settings->hash_version == 2 -+ ? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len); ++ uint32_t hash0, hash1; ++ if (settings->hash_version == 2) { ++ hash0 = murmur3_seeded_v2(seed0, data, len); ++ hash1 = murmur3_seeded_v2(seed1, data, len); ++ } else { ++ hash0 = murmur3_seeded_v1(seed0, data, len); ++ hash1 = murmur3_seeded_v1(seed1, data, len); ++ } key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t)); for (i = 0; i < settings->num_hashes; i++) ## bloom.h ## -@@ bloom.h: struct repository; +@@ bloom.h: struct commit_graph; struct bloom_filter_settings { /* * The version of the hashing technique being used. @@ bloom.h: struct repository; */ uint32_t hash_version; -@@ bloom.h: struct bloom_key { +@@ bloom.h: int load_bloom_filter_from_graph(struct commit_graph *g, * Not considered to be cryptographically secure. * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm */ @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star return 0; } -+struct graph_read_bloom_data_data { ++struct graph_read_bloom_data_context { + struct commit_graph *g; + int *commit_graph_changed_paths_version; +}; @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star size_t chunk_size, void *data) { - struct commit_graph *g = data; -+ struct graph_read_bloom_data_data *d = data; -+ struct commit_graph *g = d->g; ++ struct graph_read_bloom_data_context *c = data; ++ struct commit_graph *g = c->g; uint32_t hash_version; g->chunk_bloom_data = chunk_start; hash_version = get_be32(chunk_start); - if (hash_version != 1) -+ if (*d->commit_graph_changed_paths_version == -1) { -+ *d->commit_graph_changed_paths_version = hash_version; -+ } else if (hash_version != *d->commit_graph_changed_paths_version) { - return 0; +- return 0; ++ if (*c->commit_graph_changed_paths_version == -1) { ++ *c->commit_graph_changed_paths_version = hash_version; ++ } else if (hash_version != *c->commit_graph_changed_paths_version) { ++ return 0; + } g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings)); @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s, } - if (s->commit_graph_changed_paths_version != 0) { -+ struct graph_read_bloom_data_data data = { + if (s->commit_graph_changed_paths_version) { ++ struct graph_read_bloom_data_context context = { + .g = graph, + .commit_graph_changed_paths_version = &s->commit_graph_changed_paths_version + }; @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s, &graph->chunk_bloom_indexes); read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA, - graph_read_bloom_data, graph); -+ graph_read_bloom_data, &data); ++ graph_read_bloom_data, &context); } if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) { @@ t/t0095-bloom.sh: test_expect_success 'compute unseeded murmur3 hash for test st Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004| ## t/t4216-log-bloom.sh ## -@@ t/t4216-log-bloom.sh: get_bdat_offset () { - .git/objects/info/commit-graph - } - -+get_changed_path_filter_version () { -+ BDAT_OFFSET=$(get_bdat_offset) && -+ perl -0777 -ne \ -+ 'print unpack("H*", substr($_, '$BDAT_OFFSET', 4))' \ -+ .git/objects/info/commit-graph -+} -+ - get_first_changed_path_filter () { - BDAT_OFFSET=$(get_bdat_offset) && - perl -0777 -ne \ @@ t/t4216-log-bloom.sh: test_expect_success 'set up repo with high bit path, version 1 changed-path' ' git -C highbit1 commit-graph write --reachable --changed-paths ' @@ t/t4216-log-bloom.sh: test_expect_success 'set up repo with high bit path, versi -test_expect_success 'setup check value of version 1 changed-path' ' +test_expect_success 'check value of version 1 changed-path' ' (cd highbit1 && - printf "52a9" >expect && + echo "52a9" >expect && get_first_changed_path_filter >actual && @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when version 1 requested' ' test_bloom_filters_used "-- $CENT") @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers + git -C highbit1 commit-graph write --reachable --changed-paths && + (cd highbit1 && + git config --add commitgraph.changedPathsVersion -1 && -+ printf "00000001" >expect && -+ get_changed_path_filter_version >actual && ++ echo "options: bloom(1,10,7) read_generation_data" >expect && ++ test-tool read-graph >full && ++ grep options full >actual && + test_cmp expect actual) +' + @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers + +test_expect_success 'check value of version 2 changed-path' ' + (cd highbit2 && -+ printf "c01f" >expect && ++ echo "c01f" >expect && + get_first_changed_path_filter >actual && + test_cmp expect actual) +' @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers + git -C highbit2 commit-graph write --reachable --changed-paths && + (cd highbit2 && + git config --add commitgraph.changedPathsVersion -1 && -+ printf "00000002" >expect && -+ get_changed_path_filter_version >actual && ++ echo "options: bloom(2,10,7) read_generation_data" >expect && ++ test-tool read-graph >full && ++ grep options full >actual && + test_cmp expect actual) +' + -- 2.41.0.487.g6d72f3e995-goog