[PATCH v2 00/11] More commit-graph/Bloom filter improvements

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This builds on sg/commit-graph-cleanups, which took several patches from
Szeder's series [1] and applied them almost directly to a more-recent
version of Git [2].

[1] https://lore.kernel.org/git/20200529085038.26008-1-szeder.dev@xxxxxxxxx/
[2] 
https://lore.kernel.org/git/pull.650.git.1591362032.gitgitgadget@xxxxxxxxx/

This series adds a few extra improvements, several of which are rooted in
Szeder's original series. I maintained his authorship and sign-off, even
though the patches did not apply or cherry-pick at all.

(In v2, I have removed the range-diff comparison to Szeder's series, so look
at the v1 cover letter for that.)

The patches have been significantly reordered. René pointed out (and Szeder
discovered in the old thread) that we are not re-using the
bloom_filter_settings from the existing commit-graph when writing a new one.

 1. commit-graph: place bloom_settings in context
 2. commit-graph: change test to die on parse, not load

These are mostly the same, except we now use a pointer to the settings in
the commit-graph write context.

 3. bloom: get_bloom_filter() cleanups

This new patch is a subtle change in behavior that will become relevant in
the very next patch. In fact, if we swap patch 3 and 4, then
t4216-log-bloom.sh fails with a segfault due to a NULL filter.

 4. commit-graph: persist existence of changed-paths

This patch is now updated to use the existing changed-path filter settings.

 5. commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
 6. commit-graph: simplify chunk writes into loop
 7. commit-graph: check chunk sizes after writing

These are all the same as before.

 8. revision.c: fix whitespace

This patch is the cleanup part of Taylor's patch.

 9. revision: empty pathspecs should not use Bloom filters

Here is Taylor's fix for empty pathspecs.

 10. commit-graph: check all leading directories in changed path Bloom
     filters
 11. bloom: enforce a minimum size of 8 bytes

Finally, we get these performance patches. Patch 10 is updated to have the
better logic around directory separators and empty paths. Also, the list of
Bloom keys is ordered with the deepest path first. That has some tiny
performance benefits for deep paths since we can short-circuit the multi-key
checks more often. That code path is much faster than the tree parsing, so
it is hard to measure any change.

Thanks, -Stolee

Derrick Stolee (6):
  commit-graph: place bloom_settings in context
  commit-graph: change test to die on parse, not load
  bloom: get_bloom_filter() cleanups
  commit-graph: persist existence of changed-paths
  revision.c: fix whitespace
  bloom: enforce a minimum size of 8 bytes

SZEDER Gábor (4):
  commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
  commit-graph: simplify chunk writes into loop
  commit-graph: check chunk sizes after writing
  commit-graph: check all leading directories in changed path Bloom
    filters

Taylor Blau (1):
  revision: empty pathspecs should not use Bloom filters

 Documentation/git-commit-graph.txt |   5 +-
 bloom.c                            |  19 ++--
 builtin/commit-graph.c             |   5 +-
 commit-graph.c                     | 136 +++++++++++++++++++++--------
 commit-graph.h                     |   3 +-
 revision.c                         |  53 ++++++++---
 revision.h                         |   6 +-
 t/t4216-log-bloom.sh               |  35 +++++++-
 t/t5318-commit-graph.sh            |   2 +-
 9 files changed, 200 insertions(+), 64 deletions(-)


base-commit: 7fbfe07ab4d4e58c0971dac73001b89f180a0af3
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-659%2Fderrickstolee%2Fbloom-2-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-659/derrickstolee/bloom-2-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/659

Range-diff vs v1:

  1:  c966969071 !  1:  57002040bc commit-graph: place bloom_settings in context
     @@ Commit message
          to combine the function prototypes and use function pointers to
          simplify write_commit_graph_file().
      
     +    By using a pointer, we can later replace the settings to match those
     +    that exist in the current commit-graph, in case a future Git version
     +    allows customization of these parameters.
     +
          Reported-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
          Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
      
     @@ commit-graph.c: struct write_commit_graph_context {
       
       	const struct split_commit_graph_opts *split_opts;
       	size_t total_bloom_filter_data_size;
     -+	struct bloom_filter_settings bloom_settings;
     ++	const struct bloom_filter_settings *bloom_settings;
       };
       
       static void write_graph_chunk_fanout(struct hashfile *f,
     @@ commit-graph.c: static void write_graph_chunk_bloom_data(struct hashfile *f,
      -	hashwrite_be32(f, settings->hash_version);
      -	hashwrite_be32(f, settings->num_hashes);
      -	hashwrite_be32(f, settings->bits_per_entry);
     -+	hashwrite_be32(f, ctx->bloom_settings.hash_version);
     -+	hashwrite_be32(f, ctx->bloom_settings.num_hashes);
     -+	hashwrite_be32(f, ctx->bloom_settings.bits_per_entry);
     ++	hashwrite_be32(f, ctx->bloom_settings->hash_version);
     ++	hashwrite_be32(f, ctx->bloom_settings->num_hashes);
     ++	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);
       
       	while (list < last) {
       		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
       	struct object_id file_hash;
       	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
       
     -+	ctx->bloom_settings = bloom_settings;
     ++	ctx->bloom_settings = &bloom_settings;
      +
       	if (ctx->split) {
       		struct strbuf tmp_file = STRBUF_INIT;
  7:  60bbc15d24 =  2:  6b63f9bd8a commit-graph: change test to die on parse, not load
  -:  ---------- >  3:  492deaf916 bloom: get_bloom_filter() cleanups
  8:  db5b8fe843 !  4:  8727b25468 commit-graph: persist existence of changed-paths
     @@ Commit message
          property of "my commit-graph has changed-path filters" automatically. A
          user can drop filters using the --no-changed-paths option.
      
     +    In the process, we need to be extremely careful to match the Bloom
     +    filter settings as specified by the commit-graph. This will allow future
     +    versions of Git to customize these settings, and the version with this
     +    change will persist those settings as commit-graphs are rewritten on
     +    top.
     +
     +    Use the trace2 API to signal the settings used during the write, and
     +    check that output in a test after manually adjusting the correct bytes
     +    in the commit-graph file.
     +
          Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
      
       ## Documentation/git-commit-graph.txt ##
     @@ builtin/commit-graph.c: static int graph_write(int argc, const char **argv)
       
      
       ## commit-graph.c ##
     +@@
     + #include "progress.h"
     + #include "bloom.h"
     + #include "commit-slab.h"
     ++#include "json-writer.h"
     ++#include "trace2.h"
     + 
     + void git_test_write_commit_graph_or_die(void)
     + {
     +@@ commit-graph.c: static void write_graph_chunk_bloom_indexes(struct hashfile *f,
     + 	stop_progress(&progress);
     + }
     + 
     ++static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
     ++{
     ++	struct json_writer jw = JSON_WRITER_INIT;
     ++
     ++	jw_object_begin(&jw, 0);
     ++	jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
     ++	jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
     ++	jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
     ++	jw_end(&jw);
     ++
     ++	trace2_data_json("bloom", ctx->r, "settings", &jw);
     ++
     ++	jw_release(&jw);
     ++}
     ++
     + static void write_graph_chunk_bloom_data(struct hashfile *f,
     + 					 struct write_commit_graph_context *ctx)
     + {
     +@@ commit-graph.c: static void write_graph_chunk_bloom_data(struct hashfile *f,
     + 	struct progress *progress = NULL;
     + 	int i = 0;
     + 
     ++	trace2_bloom_filter_settings(ctx);
     ++
     + 	if (ctx->report_progress)
     + 		progress = start_delayed_progress(
     + 			_("Writing changed paths Bloom filters data"),
     +@@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
     + 	struct object_id file_hash;
     + 	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
     + 
     +-	ctx->bloom_settings = &bloom_settings;
     ++	if (!ctx->bloom_settings)
     ++		ctx->bloom_settings = &bloom_settings;
     + 
     + 	if (ctx->split) {
     + 		struct strbuf tmp_file = STRBUF_INIT;
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
       	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       
      +	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
      +		ctx->changed_paths = 1;
     -+	else if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
     ++	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
     ++		struct commit_graph *g;
      +		prepare_commit_graph_one(ctx->r, ctx->odb);
      +
     ++		g = ctx->r->objects->commit_graph;
     ++
      +		/* We have changed-paths already. Keep them in the next graph */
     -+		if (ctx->r->objects->commit_graph &&
     -+		    ctx->r->objects->commit_graph->chunk_bloom_data)
     ++		if (g && g->chunk_bloom_data) {
      +			ctx->changed_paths = 1;
     ++			ctx->bloom_settings = g->bloom_filter_settings;
     ++		}
      +	}
      +
       	if (ctx->split) {
     @@ t/t4216-log-bloom.sh: test_expect_success 'setup - add commit-graph to the chain
       	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
       '
       
     +@@ t/t4216-log-bloom.sh: test_expect_success 'Use Bloom filters if they exist in the latest but not all c
     + 	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
     + '
     + 
     ++BASE_BDAT_OFFSET=2240
     ++BASE_K_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 10))
     ++BASE_LEN_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 14))
     ++
     ++corrupt_graph() {
     ++	pos=$1
     ++	data="${2:-\0}"
     ++	grepstr=$3
     ++	orig_size=$(wc -c < .git/objects/info/commit-graph) &&
     ++	zero_pos=${4:-${orig_size}} &&
     ++	printf "$data" | dd of=".git/objects/info/commit-graph" bs=1 seek="$pos" conv=notrunc &&
     ++	dd of=".git/objects/info/commit-graph" bs=1 seek="$zero_pos" if=/dev/null
     ++}
     ++
     ++test_expect_success 'persist filter settings' '
     ++	test_when_finished rm -rf .git/objects/info/commit-graph* &&
     ++	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
     ++	grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
     ++	cp .git/objects/info/commit-graph commit-graph-before &&
     ++	corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
     ++	corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
     ++	cp .git/objects/info/commit-graph commit-graph-after &&
     ++	test_commit c18 A/corrupt &&
     ++	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
     ++	grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt
     ++'
     ++
     + test_done
     + \ No newline at end of file
  2:  65eb15221c !  5:  244668fec4 commit-graph: unify the signatures of all write_graph_chunk_*() functions
     @@ Commit message
      
       ## commit-graph.c ##
      @@ commit-graph.c: struct write_commit_graph_context {
     - 	struct bloom_filter_settings bloom_settings;
     + 	const struct bloom_filter_settings *bloom_settings;
       };
       
      -static void write_graph_chunk_fanout(struct hashfile *f,
     @@ commit-graph.c: static void write_graph_chunk_bloom_indexes(struct hashfile *f,
      +	return 0;
       }
       
     + static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
     +@@ commit-graph.c: static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
     + 	jw_release(&jw);
     + }
     + 
      -static void write_graph_chunk_bloom_data(struct hashfile *f,
      -					 struct write_commit_graph_context *ctx)
      +static int write_graph_chunk_bloom_data(struct hashfile *f,
  3:  3d24b9802d =  6:  8b959f2f37 commit-graph: simplify chunk writes into loop
  4:  bdca834e6d =  7:  3eb10933dc commit-graph: check chunk sizes after writing
  -:  ---------- >  8:  0bcfc1f051 revision.c: fix whitespace
  -:  ---------- >  9:  719c7091a7 revision: empty pathspecs should not use Bloom filters
  5:  9975fc96f1 ! 10:  9c2076b4ce commit-graph: check all leading directories in changed path Bloom filters
     @@ Commit message
          This adjusts the tracing values in t4216-log-bloom.sh, which provides a
          concrete way to notice the improvement.
      
     +    Helped-by: Taylor Blau <me@xxxxxxxxxxxx>
     +    Helped-by: René Scharfe <l.s.r@xxxxxx>
          Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
          Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
      
     @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
       	int last_index;
      -	int len;
      +	size_t len;
     -+	int path_component_nr = 0, j;
     ++	int path_component_nr = 1;
       
       	if (!revs->commits)
       		return;
      @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
     - 
     - 	len = strlen(path);
     + 		return;
     + 	}
       
      -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
      -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
      +	p = path;
     -+	do {
     -+		p = strchrnul(p + 1, '/');
     -+		path_component_nr++;
     -+	} while (p - path < len);
     ++	while (*p) {
     ++		if (is_dir_sep(*p))
     ++			path_component_nr++;
     ++		p++;
     ++	}
      +
      +	revs->bloom_keys_nr = path_component_nr;
      +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
      +
     -+	p = path;
     -+	for (j = 0; j < revs->bloom_keys_nr; j++) {
     -+		p = strchrnul(p + 1, '/');
     ++	fill_bloom_key(path, len, &revs->bloom_keys[0],
     ++		       revs->bloom_filter_settings);
     ++	path_component_nr = 1;
      +
     -+		fill_bloom_key(path, p - path, &revs->bloom_keys[j],
     -+			       revs->bloom_filter_settings);
     ++	p = path + len - 1;
     ++	while (p > path) {
     ++		if (is_dir_sep(*p))
     ++			fill_bloom_key(path, p - path,
     ++				       &revs->bloom_keys[path_component_nr++],
     ++				       revs->bloom_filter_settings);
     ++		p--;
      +	}
       
       	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
  6:  2a5f1e1752 = 11:  1022c0ad21 bloom: enforce a minimum size of 8 bytes

-- 
gitgitgadget



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux