[PATCH v3 00/10] More commit-graph/Bloom filter improvements

"Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> · Fri, 26 Jun 2020 12:30:26 +0000

This builds on sg/commit-graph-cleanups, which took several patches from
Szeder's series [1] and applied them almost directly to a more-recent
version of Git [2].

[1] https://lore.kernel.org/git/20200529085038.26008-1-szeder.dev@xxxxxxxxx/
[2] 
https://lore.kernel.org/git/pull.650.git.1591362032.gitgitgadget@xxxxxxxxx/

This series adds a few extra improvements, several of which are rooted in
Szeder's original series. I maintained his authorship and sign-off, even
though the patches did not apply or cherry-pick at all.

(In v2, I have removed the range-diff comparison to Szeder's series, so look
at the v1 cover letter for that.)

The patches have been significantly reordered. René pointed out (and Szeder
discovered in the old thread) that we are not re-using the
bloom_filter_settings from the existing commit-graph when writing a new one.

 1. commit-graph: place bloom_settings in context
 2. commit-graph: change test to die on parse, not load

These are mostly the same, except we now use a pointer to the settings in
the commit-graph write context.

 3. bloom: get_bloom_filter() cleanups

This new patch is a subtle change in behavior that will become relevant in
the very next patch. In fact, if we swap patch 3 and 4, then
t4216-log-bloom.sh fails with a segfault due to a NULL filter.

 4. commit-graph: persist existence of changed-paths

This patch is now updated to use the existing changed-path filter settings.

 5. commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
 6. commit-graph: simplify chunk writes into loop
 7. commit-graph: check chunk sizes after writing

These are all the same as before.

 8. revision.c: fix whitespace

This patch is the cleanup part of Taylor's patch.

 9. revision: empty pathspecs should not use Bloom filters

Here is Taylor's fix for empty pathspecs.

 10. commit-graph: check all leading directories in changed path Bloom
     filters
 11. bloom: enforce a minimum size of 8 bytes

Finally, we get these performance patches. Patch 10 is updated to have the
better logic around directory separators and empty paths. Also, the list of
Bloom keys is ordered with the deepest path first. That has some tiny
performance benefits for deep paths since we can short-circuit the multi-key
checks more often. That code path is much faster than the tree parsing, so
it is hard to measure any change.

Updates in V3:

 * Responded to René's feedback.
 * Fixed the test in Patch 4 to use GIT_TEST_ variables and extend the
   GIT_TRACE2 depth to work with 'seen' branch.

Thanks, -Stolee

Derrick Stolee (5):
  commit-graph: place bloom_settings in context
  commit-graph: change test to die on parse, not load
  bloom: fix logic in get_bloom_filter()
  commit-graph: persist existence of changed-paths
  revision.c: fix whitespace

SZEDER Gábor (4):
  commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
  commit-graph: simplify chunk writes into loop
  commit-graph: check chunk sizes after writing
  commit-graph: check all leading directories in changed path Bloom
    filters

Taylor Blau (1):
  revision: empty pathspecs should not use Bloom filters

 Documentation/git-commit-graph.txt |   5 +-
 bloom.c                            |  14 ++-
 builtin/commit-graph.c             |   5 +-
 commit-graph.c                     | 138 +++++++++++++++++++++--------
 commit-graph.h                     |   3 +-
 revision.c                         |  58 +++++++++---
 revision.h                         |   6 +-
 t/t4216-log-bloom.sh               |  23 ++++-
 t/t5318-commit-graph.sh            |   2 +-
 9 files changed, 189 insertions(+), 65 deletions(-)

base-commit: 7fbfe07ab4d4e58c0971dac73001b89f180a0af3
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-659%2Fderrickstolee%2Fbloom-2-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-659/derrickstolee/bloom-2-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/659

Range-diff vs v2:

  1:  57002040bc =  1:  57002040bc commit-graph: place bloom_settings in context
  2:  6b63f9bd8a =  2:  6b63f9bd8a commit-graph: change test to die on parse, not load
  3:  492deaf916 !  3:  2f809499ab bloom: get_bloom_filter() cleanups
     @@ Metadata
      Author: Derrick Stolee <dstolee@xxxxxxxxxxxxx>

       ## Commit message ##
     -    bloom: get_bloom_filter() cleanups
     +    bloom: fix logic in get_bloom_filter()

          The get_bloom_filter() method is a bit complicated in some parts where
          it does not need to be. In particular, it needs to return a NULL filter
     @@ Commit message

          Also clean up some style issues while we are here.

     +    Helped-by: René Scharfe <l.s.r@xxxxxx>
          Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>

       ## bloom.c ##
     @@ bloom.c: struct bloom_filter *get_bloom_filter(struct repository *r,
      -			else
      -				return NULL;
      -		}
     -+		    r->objects->commit_graph->chunk_bloom_indexes &&
     -+		    load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
     -+			return filter;
     ++		    r->objects->commit_graph->chunk_bloom_indexes)
     ++			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
       	}

      -	if (filter->data || !compute_if_not_present)
      +	if (filter->data)
       		return filter;
     -+	if (!filter->data && !compute_if_not_present)
     ++	if (!compute_if_not_present)
      +		return NULL;

       	repo_diff_setup(r, &diffopt);
  4:  8727b25468 !  4:  33e22d05cb commit-graph: persist existence of changed-paths
     @@ commit-graph.c: static void write_graph_chunk_bloom_data(struct hashfile *f,
       		progress = start_delayed_progress(
       			_("Writing changed paths Bloom filters data"),
      @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
     + 	int num_chunks = 3;
     + 	uint64_t chunk_offset;
       	struct object_id file_hash;
     - 	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
     +-	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
     ++	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;

      -	ctx->bloom_settings = &bloom_settings;
     -+	if (!ctx->bloom_settings)
     ++	if (!ctx->bloom_settings) {
     ++		bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
     ++							      bloom_settings.bits_per_entry);
     ++		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
     ++							  bloom_settings.num_hashes);
      +		ctx->bloom_settings = &bloom_settings;
     ++	}

       	if (ctx->split) {
       		struct strbuf tmp_file = STRBUF_INIT;
     @@ t/t4216-log-bloom.sh: test_expect_success 'Use Bloom filters if they exist in th
       	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
       '

     -+BASE_BDAT_OFFSET=2240
     -+BASE_K_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 10))
     -+BASE_LEN_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 14))
     -+
     -+corrupt_graph() {
     -+	pos=$1
     -+	data="${2:-\0}"
     -+	grepstr=$3
     -+	orig_size=$(wc -c < .git/objects/info/commit-graph) &&
     -+	zero_pos=${4:-${orig_size}} &&
     -+	printf "$data" | dd of=".git/objects/info/commit-graph" bs=1 seek="$pos" conv=notrunc &&
     -+	dd of=".git/objects/info/commit-graph" bs=1 seek="$zero_pos" if=/dev/null
     -+}
     -+
      +test_expect_success 'persist filter settings' '
      +	test_when_finished rm -rf .git/objects/info/commit-graph* &&
     -+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
     -+	grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
     -+	cp .git/objects/info/commit-graph commit-graph-before &&
     -+	corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
     -+	corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
     -+	cp .git/objects/info/commit-graph commit-graph-after &&
     -+	test_commit c18 A/corrupt &&
     -+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
     -+	grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt
     ++	rm -rf .git/objects/info/commit-graph* &&
     ++	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
     ++		GIT_TRACE2_EVENT_NESTING=5 \
     ++		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
     ++		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
     ++		git commit-graph write --reachable --changed-paths &&
     ++	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
     ++	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
     ++		GIT_TRACE2_EVENT_NESTING=5 \
     ++		git commit-graph write --reachable --changed-paths &&
     ++	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
      +'
      +
       test_done
  5:  244668fec4 !  5:  81c45d5260 commit-graph: unify the signatures of all write_graph_chunk_*() functions
     @@ Commit message
          This opens up the possibility for further cleanups and foolproofing in
          the following two patches.

     +    Helped-by: René Scharfe <l.s.r@xxxxxx>
          Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
          Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>

     @@ commit-graph.c: static void write_graph_chunk_fanout(struct hashfile *f,
       	for (count = 0; count < ctx->commits.nr; count++, list++) {
       		display_progress(ctx->progress, ++ctx->progress_cnt);
      -		hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
     -+		hashwrite(f, (*list)->object.oid.hash, (int)the_hash_algo->rawsz);
     ++		hashwrite(f, (*list)->object.oid.hash, the_hash_algo->rawsz);
       	}
      +
      +	return 0;
  6:  8b959f2f37 !  6:  8828dcd906 commit-graph: simplify chunk writes into loop
     @@ Commit message
          file.  This will open up the possibility for a bit of foolproofing in
          the following patch.

     +    Helped-by: René Scharfe <l.s.r@xxxxxx>
          Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
          Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>

     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
      -		return -1;
      +
      +	for (i = 0; i < num_chunks; i++) {
     -+		if (chunks[i].write_fn(f, ctx)) {
     -+			error(_("failed writing chunk with id %"PRIx32""),
     -+			      chunks[i].id);
     ++		if (chunks[i].write_fn(f, ctx))
      +			return -1;
     -+		}
       	}
      +
       	stop_progress(&ctx->progress);
  7:  3eb10933dc !  7:  ddbf297755 commit-graph: check chunk sizes after writing
     @@ Commit message
          in a loop we can do this check in a single place for all chunks, and
          any chunks added in the future will get checked as well.

     +    Helped-by: René Scharfe <l.s.r@xxxxxx>
          Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
          Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>

       ## commit-graph.c ##
      @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
     - 			num_chunks * ctx->commits.nr);
       	}

     -+	chunk_offset = f->total + f->offset;
       	for (i = 0; i < num_chunks; i++) {
     -+		uint64_t end_offset;
     ++		uint64_t start_offset = f->total + f->offset;
      +
     - 		if (chunks[i].write_fn(f, ctx)) {
     - 			error(_("failed writing chunk with id %"PRIx32""),
     - 			      chunks[i].id);
     + 		if (chunks[i].write_fn(f, ctx))
       			return -1;
     - 		}
      +
     -+		end_offset = f->total + f->offset;
     -+		if (end_offset - chunk_offset != chunks[i].size)
     ++		if (f->total + f->offset != start_offset + chunks[i].size)
      +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
     -+			    chunks[i].size, chunks[i].id, end_offset - chunk_offset);
     -+		chunk_offset = end_offset;
     ++			    chunks[i].size, chunks[i].id,
     ++			    f->total + f->offset - start_offset);
       	}

       	stop_progress(&ctx->progress);
  8:  0bcfc1f051 =  8:  8b63706141 revision.c: fix whitespace
  9:  719c7091a7 =  9:  7d6163305a revision: empty pathspecs should not use Bloom filters
 10:  9c2076b4ce ! 10:  40061233ca commit-graph: check all leading directories in changed path Bloom filters
     @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
      -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
      +	p = path;
      +	while (*p) {
     -+		if (is_dir_sep(*p))
     ++		/*
     ++		 * At this point, the path is normalized to use Unix-style
     ++		 * path separators. This is required due to how the
     ++		 * changed-path Bloom filters store the paths.
     ++		 */
     ++		if (*p == '/')
      +			path_component_nr++;
      +		p++;
      +	}
     @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
      +
      +	p = path + len - 1;
      +	while (p > path) {
     -+		if (is_dir_sep(*p))
     ++		if (*p == '/')
      +			fill_bloom_key(path, p - path,
      +				       &revs->bloom_keys[path_component_nr++],
      +				       revs->bloom_filter_settings);
 11:  1022c0ad21 <  -:  ---------- bloom: enforce a minimum size of 8 bytes

-- 
gitgitgadget