[PATCH v5 00/17] bloom: changed-path Bloom filters v2 (& sundries)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



(Rebased onto the tip of 'master', which is d4dbce1db5 (The seventh
batch, 2024-01-12), at the time of writing).

This series is a reroll of the combined efforts of [1] and [2] to
introduce the v2 changed-path Bloom filters, which fixes a bug in our
existing implementation of murmur3 paths with non-ASCII characters (when
the "char" type is signed).

In large part, this is the same as the previous round. Like last time,
this round addresses the remaining additional issues pointed out by
SZEDER Gábor. The remaining issues which have been addressed by this
series are:

  - Incorrectly reading Bloom filters computed with differing hash
    versions. This has been corrected by discarding them when a version
    mismatch is detected.

  - Added a note about the new `commitGraph.changedPathVersion`
    configuration variable which can cause (un-fixable, see [3])
    issues in earlier versions of Git which do not yet understand them.

Thanks to Jonathan, Peff, and SZEDER who have helped a great deal in
assembling these patches. As usual, a range-diff is included below.

Thanks in advance for your review!

[1]: https://lore.kernel.org/git/cover.1684790529.git.jonathantanmy@xxxxxxxxxx/
[2]: https://lore.kernel.org/git/cover.1691426160.git.me@xxxxxxxxxxxx/
[3]: https://lore.kernel.org/git/Zabr1Glljjgl%2FUUB@nand.local/

Jonathan Tan (1):
  gitformat-commit-graph: describe version 2 of BDAT

Taylor Blau (16):
  t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()`
  revision.c: consult Bloom filters for root commits
  commit-graph: ensure Bloom filters are read with consistent settings
  t/helper/test-read-graph.c: extract `dump_graph_info()`
  bloom.h: make `load_bloom_filter_from_graph()` public
  t/helper/test-read-graph: implement `bloom-filters` mode
  t4216: test changed path filters with high bit paths
  repo-settings: introduce commitgraph.changedPathsVersion
  commit-graph: new Bloom filter version that fixes murmur3
  bloom: annotate filters with hash version
  bloom: prepare to discard incompatible Bloom filters
  commit-graph.c: unconditionally load Bloom filters
  commit-graph: drop unnecessary `graph_read_bloom_data_context`
  object.h: fix mis-aligned flag bits table
  commit-graph: reuse existing Bloom filters where possible
  bloom: introduce `deinit_bloom_filters()`

 Documentation/config/commitgraph.txt     |  29 ++-
 Documentation/gitformat-commit-graph.txt |   9 +-
 bloom.c                                  | 208 ++++++++++++++-
 bloom.h                                  |  38 ++-
 commit-graph.c                           |  64 ++++-
 object.h                                 |   3 +-
 oss-fuzz/fuzz-commit-graph.c             |   2 +-
 repo-settings.c                          |   6 +-
 repository.h                             |   2 +-
 revision.c                               |  26 +-
 t/helper/test-bloom.c                    |   9 +-
 t/helper/test-read-graph.c               |  67 ++++-
 t/t0095-bloom.sh                         |   8 +
 t/t4216-log-bloom.sh                     | 310 ++++++++++++++++++++++-
 14 files changed, 724 insertions(+), 57 deletions(-)

Range-diff against v4:
 1:  e0fc51c3fb !  1:  c5e1b3e507 t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()`
    @@ Commit message
         checking for the absence of any data from it from trace2.
     
         In the following commit, it will become possible to load Bloom filters
    -    without using them (e.g., because `commitGraph.changedPathVersion` is
    -    incompatible with the hash version with which the commit-graph's Bloom
    -    filters were written).
    +    without using them (e.g., because the `commitGraph.changedPathVersion`
    +    introduced later in this series is incompatible with the hash version
    +    with which the commit-graph's Bloom filters were written).
     
         When this is the case, it's possible to initialize the Bloom filter
         sub-system, while still not using any Bloom filters. When this is the
 2:  87b09e6266 !  2:  8f32fd5f46 revision.c: consult Bloom filters for root commits
    @@ revision.c: static int rev_compare_tree(struct rev_info *revs,
     +				  int nth_parent)
      {
      	struct tree *t1 = repo_get_commit_tree(the_repository, commit);
    -+	int bloom_ret = 1;
    ++	int bloom_ret = -1;
      
      	if (!t1)
      		return 0;
      
    -+	if (nth_parent == 1 && revs->bloom_keys_nr) {
    ++	if (!nth_parent && revs->bloom_keys_nr) {
     +		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
     +		if (!bloom_ret)
     +			return 1;
    @@ revision.c: static void try_to_simplify_commit(struct rev_info *revs, struct com
     +		 * (if one exists) is relative to the empty tree, using Bloom
     +		 * filters is allowed here.
     +		 */
    -+		if (rev_same_tree_as_empty(revs, commit, 1))
    ++		if (rev_same_tree_as_empty(revs, commit, 0))
      			commit->object.flags |= TREESAME;
      		return;
      	}
 3:  46d8a41005 !  3:  285b25f1b7 commit-graph: ensure Bloom filters are read with consistent settings
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
     +	test_must_be_empty err
     +'
     +
    - test_done
    + corrupt_graph () {
    +-	graph=.git/objects/info/commit-graph &&
    + 	test_when_finished "rm -rf $graph" &&
    + 	git commit-graph write --reachable --changed-paths &&
    + 	corrupt_chunk_file $graph "$@"
 4:  4d0190a992 =  4:  0cee8078d4 gitformat-commit-graph: describe version 2 of BDAT
 5:  3c2057c11c =  5:  1fc8d2828d t/helper/test-read-graph.c: extract `dump_graph_info()`
 6:  e002e35004 !  6:  03dd7cf30a bloom.h: make `load_bloom_filter_from_graph()` public
    @@ Commit message
         Signed-off-by: Taylor Blau <me@xxxxxxxxxxxx>
     
      ## bloom.c ##
    -@@ bloom.c: static inline unsigned char get_bitmask(uint32_t pos)
    - 	return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
    +@@ bloom.c: static int check_bloom_offset(struct commit_graph *g, uint32_t pos,
    + 	return -1;
      }
      
     -static int load_bloom_filter_from_graph(struct commit_graph *g,
 7:  c7016f51cd =  7:  dd9193e404 t/helper/test-read-graph: implement `bloom-filters` mode
 8:  cef2aac8ba !  8:  aa2416795d t4216: test changed path filters with high bit paths
    @@
      ## Metadata ##
    -Author: Jonathan Tan <jonathantanmy@xxxxxxxxxx>
    +Author: Taylor Blau <me@xxxxxxxxxxxx>
     
      ## Commit message ##
         t4216: test changed path filters with high bit paths
    @@ t/t4216-log-bloom.sh: test_expect_success 'merge graph layers with incompatible
     +	)
     +'
     +
    - test_done
    + corrupt_graph () {
    + 	test_when_finished "rm -rf $graph" &&
    + 	git commit-graph write --reachable --changed-paths &&
 9:  36d4e2202e !  9:  a77dcc99b4 repo-settings: introduce commitgraph.changedPathsVersion
    @@
      ## Metadata ##
    -Author: Jonathan Tan <jonathantanmy@xxxxxxxxxx>
    +Author: Taylor Blau <me@xxxxxxxxxxxx>
     
      ## Commit message ##
         repo-settings: introduce commitgraph.changedPathsVersion
    @@ Documentation/config/commitgraph.txt: commitGraph.maxNewFilters::
     +
     +commitGraph.changedPathsVersion::
     +	Specifies the version of the changed-path Bloom filters that Git will read and
    -+	write. May be -1, 0 or 1.
    ++	write. May be -1, 0 or 1. Note that values greater than 1 may be
    ++	incompatible with older versions of Git which do not yet understand
    ++	those versions. Use caution when operating in a mixed-version
    ++	environment.
     ++
     +Defaults to -1.
     ++
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
      
     -	if (s->commit_graph_read_changed_paths) {
     +	if (s->commit_graph_changed_paths_version) {
    - 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
    - 			   &graph->chunk_bloom_indexes);
    + 		read_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
    + 			   graph_read_bloom_index, graph);
      		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
    +@@ commit-graph.c: static void validate_mixed_bloom_settings(struct commit_graph *g)
    + 		}
    + 
    + 		if (g->bloom_filter_settings->bits_per_entry != settings->bits_per_entry ||
    +-		    g->bloom_filter_settings->num_hashes != settings->num_hashes) {
    ++		    g->bloom_filter_settings->num_hashes != settings->num_hashes ||
    ++		    g->bloom_filter_settings->hash_version != settings->hash_version) {
    + 			g->chunk_bloom_indexes = NULL;
    + 			g->chunk_bloom_data = NULL;
    + 			FREE_AND_NULL(g->bloom_filter_settings);
     
      ## oss-fuzz/fuzz-commit-graph.c ##
     @@ oss-fuzz/fuzz-commit-graph.c: int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
    @@ repository.h: struct repo_settings {
      	int gc_write_commit_graph;
      	int fetch_write_commit_graph;
      	int command_requires_full_index;
    +
    + ## t/t4216-log-bloom.sh ##
    +@@ t/t4216-log-bloom.sh: test_expect_success 'setup for mixed Bloom setting tests' '
    + 	done
    + '
    + 
    +-test_expect_success 'ensure incompatible Bloom filters are ignored' '
    ++test_expect_success 'ensure Bloom filters with incompatible settings are ignored' '
    + 	# Compute Bloom filters with "unusual" settings.
    + 	git -C $repo rev-parse one >in &&
    + 	GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=3 git -C $repo commit-graph write \
    +@@ t/t4216-log-bloom.sh: test_expect_success 'merge graph layers with incompatible Bloom settings' '
    + 	test_must_be_empty err
    + '
    + 
    ++test_expect_success 'ensure Bloom filter with incompatible versions are ignored' '
    ++	rm "$repo/$graph" &&
    ++
    ++	git -C $repo log --oneline --no-decorate -- $CENT >expect &&
    ++
    ++	# Compute v1 Bloom filters for commits at the bottom.
    ++	git -C $repo rev-parse HEAD^ >in &&
    ++	git -C $repo commit-graph write --stdin-commits --changed-paths \
    ++		--split <in &&
    ++
    ++	# Compute v2 Bloomfilters for the rest of the commits at the top.
    ++	git -C $repo rev-parse HEAD >in &&
    ++	git -C $repo -c commitGraph.changedPathsVersion=2 commit-graph write \
    ++		--stdin-commits --changed-paths --split=no-merge <in &&
    ++
    ++	test_line_count = 2 $repo/$chain &&
    ++
    ++	git -C $repo log --oneline --no-decorate -- $CENT >actual 2>err &&
    ++	test_cmp expect actual &&
    ++
    ++	layer="$(head -n 1 $repo/$chain)" &&
    ++	cat >expect.err <<-EOF &&
    ++	warning: disabling Bloom filters for commit-graph layer $SQ$layer$SQ due to incompatible settings
    ++	EOF
    ++	test_cmp expect.err err
    ++'
    ++
    + get_first_changed_path_filter () {
    + 	test-tool read-graph bloom-filters >filters.dat &&
    + 	head -n 1 filters.dat
10:  f6ab427ead ! 10:  f0f22e852c commit-graph: new filter ver. that fixes murmur3
    @@
      ## Metadata ##
    -Author: Jonathan Tan <jonathantanmy@xxxxxxxxxx>
    +Author: Taylor Blau <me@xxxxxxxxxxxx>
     
      ## Commit message ##
    -    commit-graph: new filter ver. that fixes murmur3
    +    commit-graph: new Bloom filter version that fixes murmur3
     
         The murmur3 implementation in bloom.c has a bug when converting series
         of 4 bytes into network-order integers when char is signed (which is
    @@ Documentation/config/commitgraph.txt: commitGraph.readChangedPaths::
      
      commitGraph.changedPathsVersion::
      	Specifies the version of the changed-path Bloom filters that Git will read and
    --	write. May be -1, 0 or 1.
    -+	write. May be -1, 0, 1, or 2.
    - +
    - Defaults to -1.
    - +
    +-	write. May be -1, 0 or 1. Note that values greater than 1 may be
    ++	write. May be -1, 0, 1, or 2. Note that values greater than 1 may be
    + 	incompatible with older versions of Git which do not yet understand
    + 	those versions. Use caution when operating in a mixed-version
    + 	environment.
     @@ Documentation/config/commitgraph.txt: filters when instructed to write.
      If 1, Git will only read version 1 Bloom filters, and will write version 1
      Bloom filters.
    @@ bloom.h: int load_bloom_filter_from_graph(struct commit_graph *g,
      		    size_t len,
     
      ## commit-graph.c ##
    -@@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_start,
    +@@ commit-graph.c: static int graph_read_bloom_index(const unsigned char *chunk_start,
      	return 0;
      }
      
    @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star
     +	struct graph_read_bloom_data_context *c = data;
     +	struct commit_graph *g = c->g;
      	uint32_t hash_version;
    --	g->chunk_bloom_data = chunk_start;
    - 	hash_version = get_be32(chunk_start);
      
    --	if (hash_version != 1)
    -+	if (*c->commit_graph_changed_paths_version == -1) {
    + 	if (chunk_size < BLOOMDATA_CHUNK_HEADER_SIZE) {
    +@@ commit-graph.c: static int graph_read_bloom_data(const unsigned char *chunk_start,
    + 		return -1;
    + 	}
    + 
    ++	hash_version = get_be32(chunk_start);
    ++
    ++	if (*c->commit_graph_changed_paths_version == -1)
     +		*c->commit_graph_changed_paths_version = hash_version;
    -+	} else if (hash_version != *c->commit_graph_changed_paths_version) {
    - 		return 0;
    -+	}
    - 
    -+	g->chunk_bloom_data = chunk_start;
    ++	else if (hash_version != *c->commit_graph_changed_paths_version)
    ++		return 0;
    ++
    + 	g->chunk_bloom_data = chunk_start;
    + 	g->chunk_bloom_data_size = chunk_size;
    +-	hash_version = get_be32(chunk_start);
    +-
    +-	if (hash_version != 1)
    +-		return 0;
    +-
      	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
      	g->bloom_filter_settings->hash_version = hash_version;
      	g->bloom_filter_settings->num_hashes = get_be32(chunk_start + 4);
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
     +			.g = graph,
     +			.commit_graph_changed_paths_version = &s->commit_graph_changed_paths_version
     +		};
    - 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
    - 			   &graph->chunk_bloom_indexes);
    + 		read_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
    + 			   graph_read_bloom_index, graph);
      		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
     -			   graph_read_bloom_data, graph);
     +			   graph_read_bloom_data, &context);
    @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers
     +	)
     +'
     +
    - test_done
    + corrupt_graph () {
    + 	test_when_finished "rm -rf $graph" &&
    + 	git commit-graph write --reachable --changed-paths &&
11:  dc69b28329 = 11:  b56e94cad7 bloom: annotate filters with hash version
12:  85dbdc4ed2 = 12:  ddfd1ba32a bloom: prepare to discard incompatible Bloom filters
13:  3ff669a622 ! 13:  72aabd289b commit-graph.c: unconditionally load Bloom filters
    @@ Metadata
      ## Commit message ##
         commit-graph.c: unconditionally load Bloom filters
     
    -    In 9e4df4da07 (commit-graph: new filter ver. that fixes murmur3,
    -    2023-08-01), we began ignoring the Bloom data ("BDAT") chunk for
    -    commit-graphs whose Bloom filters were computed using a hash version
    -    incompatible with the value of `commitGraph.changedPathVersion`.
    +    In an earlier commit, we began ignoring the Bloom data ("BDAT") chunk
    +    for commit-graphs whose Bloom filters were computed using a hash version
    +      incompatible with the value of `commitGraph.changedPathVersion`.
     
         Now that the Bloom API has been hardened to discard these incompatible
         filters (with the exception of low-level APIs), we can safely load these
    @@ Commit message
     
      ## commit-graph.c ##
     @@ commit-graph.c: static int graph_read_bloom_data(const unsigned char *chunk_start,
    - 	uint32_t hash_version;
    + 
      	hash_version = get_be32(chunk_start);
      
    --	if (*c->commit_graph_changed_paths_version == -1) {
    +-	if (*c->commit_graph_changed_paths_version == -1)
     -		*c->commit_graph_changed_paths_version = hash_version;
    --	} else if (hash_version != *c->commit_graph_changed_paths_version) {
    +-	else if (hash_version != *c->commit_graph_changed_paths_version)
     -		return 0;
    --	}
     -
      	g->chunk_bloom_data = chunk_start;
    + 	g->chunk_bloom_data_size = chunk_size;
      	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
    - 	g->bloom_filter_settings->hash_version = hash_version;
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
      	ctx->write_generation_data = (get_configured_generation_version(r) == 2);
      	ctx->num_generation_data_overflows = 0;
14:  1c78e3d178 ! 14:  526beb9766 commit-graph: drop unnecessary `graph_read_bloom_data_context`
    @@ Commit message
         Signed-off-by: Taylor Blau <me@xxxxxxxxxxxx>
     
      ## commit-graph.c ##
    -@@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_start,
    +@@ commit-graph.c: static int graph_read_bloom_index(const unsigned char *chunk_start,
      	return 0;
      }
      
    @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star
     -	struct commit_graph *g = c->g;
     +	struct commit_graph *g = data;
      	uint32_t hash_version;
    - 	hash_version = get_be32(chunk_start);
      
    + 	if (chunk_size < BLOOMDATA_CHUNK_HEADER_SIZE) {
     @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
      	}
      
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
     -			.g = graph,
     -			.commit_graph_changed_paths_version = &s->commit_graph_changed_paths_version
     -		};
    - 		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
    - 			   &graph->chunk_bloom_indexes);
    + 		read_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
    + 			   graph_read_bloom_index, graph);
      		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
     -			   graph_read_bloom_data, &context);
     +			   graph_read_bloom_data, graph);
15:  a289514faa = 15:  c683697efa object.h: fix mis-aligned flag bits table
16:  6a12e39e7f ! 16:  4bf043be9a commit-graph: reuse existing Bloom filters where possible
    @@ Metadata
      ## Commit message ##
         commit-graph: reuse existing Bloom filters where possible
     
    -    In 9e4df4da07 (commit-graph: new filter ver. that fixes murmur3,
    -    2023-08-01), a bug was described where it's possible for Git to produce
    -    non-murmur3 hashes when the platform's "char" type is signed, and there
    -    are paths with characters whose highest bit is set (i.e. all characters
    -    >= 0x80).
    +    In an earlier commit, a bug was described where it's possible for Git to
    +    produce non-murmur3 hashes when the platform's "char" type is signed,
    +    and there are paths with characters whose highest bit is set (i.e. all
    +    characters >= 0x80).
     
         That patch allows the caller to control which version of Bloom filters
         are read and written. However, even on platforms with a signed "char"
    @@ t/t4216-log-bloom.sh: test_expect_success 'when writing commit graph, do not reu
     +	test_filter_upgraded 1 trace2.txt
     +'
     +
    - test_done
    + corrupt_graph () {
    + 	test_when_finished "rm -rf $graph" &&
    + 	git commit-graph write --reachable --changed-paths &&
17:  8942f205c8 = 17:  7daa0d8833 bloom: introduce `deinit_bloom_filters()`

base-commit: d4dbce1db5cd227a57074bcfc7ec9f0655961bba
-- 
2.43.0.334.gd4dbce1db5.dirty




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux