[PATCH v2 00/11] [RFC] Commit-graph: Write incremental files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This patch series is marked as RFC quality because it is missing some key
features and tests, but hopefully starts a concrete discussion of how the
incremental commit-graph writes can work. This version takes the design
suggestions from the earlier discussion and tries to work out most of the
concerns.

The commit-graph is a valuable performance feature for repos with large
commit histories, but suffers from the same problem as git repack: it
rewrites the entire file every time. This can be slow when there are
millions of commits, especially after we stopped reading from the
commit-graph file during a write in 43d3561 (commit-graph write: don't die
if the existing graph is corrupt).

Instead, create a "chain" of commit-graphs in the
.git/objects/info/commit-graphs folder with name graph-{hash}.graph. The
list of hashes is given by the commit-graph-chain file, and also in a "base
graph chunk" in the commit-graph format. As we read a chain, we can verify
that the hashes match the trailing hash of each commit-graph we read along
the way and each hash below a level is expected by that graph file.

When writing, we don't always want to add a new level to the stack. This
would eventually result in performance degradation, especially when
searching for a commit (before we know its graph position). We decide to
merge levels of the stack when the new commits we will write satisfy two
conditions:

 1. The expected size of the new file is more than half the size of the tip
    of the stack.
 2. The new file contains more than 64,000 commits.

The first condition alone would prevent more than a logarithmic number of
levels. The second condition is a stop-gap to prevent performance issues
when another process starts reading the commit-graph stack as we are merging
a large stack of commit-graph files. The reading process could be in a state
where the new file is not ready, but the levels above the new file were
already deleted. Thus, the commits that were merged down must be parsed from
pack-files.

The performance is necessarily amortized across multiple writes, so I tested
by writing commit-graphs from the (non-rc) tags in the Linux repo. My test
included 72 tags, and wrote everything reachable from the tag using 
--stdin-commits. Here are the overall perf numbers:

git commit-graph write --stdin-commits:         8m 12s
git commit-graph write --stdin-commits --split:    48s

The test using --split included at least six full collapses to the full
commit-graph. I believe the commit-graph stack had at most three levels
during this test.

Here are a few points that still need to be addressed before this is ready
for full review:

 * The merge strategy values should be extracted into config options.
   
   
 * If we have a commit-graph chain and someone writes without "--split" it
   will make a new commit-graph file and not clean up the old files.
   
   
 * We need to update 'git commit-graph verify' to understand the chains, and
   test that it catches the new problems. It would be good to have a
   '--shallow' option to only verify the tip file, as if we run that after
   every write we can have some confidence that the files at rest are still
   valid and we only need to check the smaller file. (This is the main
   reason this is a priority to the VFS for Git team.)
   
   

This is based on ds/commit-graph-write-refactor.

Thanks, -Stolee

[1] 
https://github.com/git/git/commit/43d356180556180b4ef6ac232a14498a5bb2b446
commit-graph write: don't die if the existing graph is corrupt

Derrick Stolee (11):
  commit-graph: document commit-graph chains
  commit-graph: prepare for commit-graph chains
  commit-graph: rename commit_compare to oid_compare
  commit-graph: load commit-graph chains
  commit-graph: add base graphs chunk
  commit-graph: rearrange chunk count logic
  commit-graph: write commit-graph chains
  commit-graph: add --split option to builtin
  commit-graph: merge commit-graph chains
  commit-graph: allow cross-alternate chains
  commit-graph: expire commit-graph files

 .../technical/commit-graph-format.txt         |  11 +-
 Documentation/technical/commit-graph.txt      | 195 +++++
 builtin/commit-graph.c                        |  10 +-
 commit-graph.c                                | 734 +++++++++++++++++-
 commit-graph.h                                |   7 +
 t/t5318-commit-graph.sh                       |   2 +-
 t/t5323-split-commit-graph.sh                 | 172 ++++
 7 files changed, 1088 insertions(+), 43 deletions(-)
 create mode 100755 t/t5323-split-commit-graph.sh


base-commit: 8520d7fc7c6edd4d71582c69a873436029b6cb1b
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-184%2Fderrickstolee%2Fgraph%2Fincremental-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-184/derrickstolee/graph/incremental-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/184

Range-diff vs v1:

  1:  0be7713a25 <  -:  ---------- commit-graph: fix the_repository reference
  2:  86b18f15ba <  -:  ---------- commit-graph: return with errors during write
  3:  9299a7fe25 <  -:  ---------- commit-graph: collapse parameters into flags
  4:  f9b719fc7a <  -:  ---------- commit-graph: remove Future Work section
  5:  74e40970e0 <  -:  ---------- commit-graph: create write_commit_graph_context
  6:  54817ef50b <  -:  ---------- commit-graph: extract fill_oids_from_packs()
  7:  cf792d38ed <  -:  ---------- commit-graph: extract fill_oids_from_commit_hex()
  8:  aaae85f1ec <  -:  ---------- commit-graph: extract fill_oids_from_all_packs()
  9:  9d434dc38c <  -:  ---------- commit-graph: extract count_distinct_commits()
 10:  ebd665468e <  -:  ---------- commit-graph: extract copy_oids_to_commits()
 11:  3eee3667cf <  -:  ---------- commit-graph: extract write_commit_graph_file()
 12:  7bbe8d9150 <  -:  ---------- Documentation: describe split commit-graphs
  -:  ---------- >  1:  a423afbfdd commit-graph: document commit-graph chains
 13:  9d0e966a3d !  2:  249668fc92 commit-graph: lay groundwork for incremental files
     @@ -1,6 +1,26 @@
      Author: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
      
     -    commit-graph: lay groundwork for incremental files
     +    commit-graph: prepare for commit-graph chains
     +
     +    To prepare for a chain of commit-graph files, augment the
     +    commit_graph struct to point to a base commit_graph. As we load
     +    commits from the graph, we may actually want to read from a base
     +    file according to the graph position.
     +
     +    The "graph position" of a commit is given by concatenating the
     +    lexicographic commit orders from each of the commit-graph files in
     +    the chain. This means that we must distinguish two values:
     +
     +     * lexicographic index : the position within the lexicographic
     +       order in a single commit-graph file.
     +
     +     * graph position: the posiiton within the concatenated order
     +       of multiple commit-graph files
     +
     +    Given the lexicographic index of a commit in a graph, we can
     +    compute the graph position by adding the number of commits in
     +    the lower-level graphs. To find the lexicographic index of
     +    a commit, we subtract the number of commits in lower-level graphs.
      
          Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
      
     @@ -13,21 +33,21 @@
       
      +static void load_oid_from_graph(struct commit_graph *g, int pos, struct object_id *oid)
      +{
     ++	uint32_t lex_index;
     ++
      +	if (!g)
      +		BUG("NULL commit-graph");
      +
     -+	if (pos < g->num_commits_in_base) {
     -+		load_oid_from_graph(g->base_graph, pos, oid);
     -+		return;
     -+	}
     ++	while (pos < g->num_commits_in_base)
     ++		g = g->base_graph;
      +
      +	if (pos >= g->num_commits + g->num_commits_in_base)
      +		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
      +		    pos, g->num_commits, g->num_commits_in_base);
      +
     -+	pos -= g->num_commits_in_base;
     ++	lex_index = pos - g->num_commits_in_base;
      +
     -+	hashcpy(oid->hash, g->chunk_oid_lookup + g->hash_len * pos);
     ++	hashcpy(oid->hash, g->chunk_oid_lookup + g->hash_len * lex_index);
      +}
      +
       static struct commit_list **insert_parent_or_die(struct repository *r,
     @@ -47,24 +67,32 @@
       	if (!c)
       		die(_("could not find commit %s"), oid_to_hex(&oid));
      @@
     + 
       static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
       {
     - 	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
     --	item->graph_pos = pos;
     -+	item->graph_pos = pos + g->num_commits_in_base;
     +-	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
     ++	const unsigned char *commit_data;
     ++	uint32_t lex_index;
     ++
     ++	while (pos < g->num_commits_in_base)
     ++		g = g->base_graph;
     ++
     ++	lex_index = pos - g->num_commits_in_base;
     ++	commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
     + 	item->graph_pos = pos;
       	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
       }
     - 
      @@
       	uint32_t *parent_data_ptr;
       	uint64_t date_low, date_high;
       	struct commit_list **pptr;
      -	const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
      +	const unsigned char *commit_data;
     ++	uint32_t lex_index;
       
      -	item->object.parsed = 1;
     -+	if (pos < g->num_commits_in_base)
     -+		return fill_commit_in_graph(r, item, g->base_graph, pos);
     ++	while (pos < g->num_commits_in_base)
     ++		g = g->base_graph;
      +
      +	if (pos >= g->num_commits + g->num_commits_in_base)
      +		BUG("position %d is beyond the scope of this commit-graph (%d local + %d base commits)",
     @@ -75,9 +103,9 @@
      +	 * "local" position for the rest of the calculation.
      +	 */
       	item->graph_pos = pos;
     -+	pos -= g->num_commits_in_base;
     ++	lex_index = pos - g->num_commits_in_base;
      +
     -+	commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos;
     ++	commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
      +
      +	item->object.parsed = 1;
       
     @@ -89,13 +117,13 @@
       	} else {
      -		return bsearch_graph(g, &(item->object.oid), pos);
      +		struct commit_graph *cur_g = g;
     -+		uint32_t pos_in_g;
     ++		uint32_t lex_index;
      +
     -+		while (cur_g && !bsearch_graph(cur_g, &(item->object.oid), &pos_in_g))
     ++		while (cur_g && !bsearch_graph(cur_g, &(item->object.oid), &lex_index))
      +			cur_g = cur_g->base_graph;
      +
      +		if (cur_g) {
     -+			*pos = pos_in_g + cur_g->num_commits_in_base;
     ++			*pos = lex_index + cur_g->num_commits_in_base;
      +			return 1;
      +		}
      +
     @@ -111,8 +139,8 @@
      -					   GRAPH_DATA_WIDTH * (c->graph_pos);
      +	const unsigned char *commit_data;
      +
     -+	if (c->graph_pos < g->num_commits_in_base)
     -+		return load_tree_for_commit(r, g->base_graph, c);
     ++	while (c->graph_pos < g->num_commits_in_base)
     ++		g = g->base_graph;
      +
      +	commit_data = g->chunk_commit_data +
      +			GRAPH_DATA_WIDTH * (c->graph_pos - g->num_commits_in_base);
     @@ -133,8 +161,3 @@
       	const uint32_t *chunk_oid_fanout;
       	const unsigned char *chunk_oid_lookup;
       	const unsigned char *chunk_commit_data;
     - 	const unsigned char *chunk_extra_edges;
     -+	const unsigned char *chunk_base_graph;
     - };
     - 
     - struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
 14:  4436c8f4f1 <  -:  ---------- commit-graph: load split commit-graph files
 15:  aa4a096813 <  -:  ---------- commit-graph: write split commit-graph files
  -:  ---------- >  3:  809fa7ad80 commit-graph: rename commit_compare to oid_compare
  -:  ---------- >  4:  a8c0b47c8a commit-graph: load commit-graph chains
  -:  ---------- >  5:  4fefd0a654 commit-graph: add base graphs chunk
  -:  ---------- >  6:  a595a1eb65 commit-graph: rearrange chunk count logic
  -:  ---------- >  7:  9cbfb656b3 commit-graph: write commit-graph chains
 16:  7c5bc06d14 !  8:  5ad14f574b commit-graph: add --split option
     @@ -1,6 +1,16 @@
      Author: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
      
     -    commit-graph: add --split option
     +    commit-graph: add --split option to builtin
     +
     +    Add a new "--split" option to the 'git commit-graph write' subcommand. This
     +    option allows the optional behavior of writing a commit-graph chain.
     +
     +    The current behavior will add a tip commit-graph containing any commits that
     +    are not in the existing commit-graph or commit-graph chain. Later changes
     +    will allow merging the chain and expiring out-dated files.
     +
     +    Add a new test script (t5323-split-commit-graph.sh) that demonstrates this
     +    behavior.
      
          Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
      
     @@ -55,39 +65,130 @@
       	read_replace_refs = 0;
       
      
     - diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
     - --- a/t/t5318-commit-graph.sh
     - +++ b/t/t5318-commit-graph.sh
     + diff --git a/t/t5323-split-commit-graph.sh b/t/t5323-split-commit-graph.sh
     + new file mode 100755
     + --- /dev/null
     + +++ b/t/t5323-split-commit-graph.sh
      @@
     - 	test_cmp_bin commit-graph-after-gc $objdir/info/commit-graph
     - '
     - 
     -+test_expect_success 'write split commit-graph' '
     -+	cd "$TRASH_DIRECTORY" &&
     -+	git clone full split &&
     -+	cd split &&
     ++#!/bin/sh
     ++
     ++test_description='split commit graph'
     ++. ./test-lib.sh
     ++
     ++GIT_TEST_COMMIT_GRAPH=0
     ++
     ++test_expect_success 'setup repo' '
     ++	git init &&
      +	git config core.commitGraph true &&
     -+	for i in $(test_seq 1 20); do
     -+		test_commit padding-$i
     ++	infodir=".git/objects/info" &&
     ++	graphdir="$infodir/commit-graphs" &&
     ++	test_oid_init
     ++'
     ++
     ++graph_read_expect() {
     ++	NUM_BASE=0
     ++	if test ! -z $2
     ++	then
     ++		NUM_BASE=$2
     ++	fi
     ++	cat >expect <<- EOF
     ++	header: 43475048 1 1 3 $NUM_BASE
     ++	num_commits: $1
     ++	chunks: oid_fanout oid_lookup commit_metadata
     ++	EOF
     ++	git commit-graph read >output &&
     ++	test_cmp expect output
     ++}
     ++
     ++test_expect_success 'create commits and write commit-graph' '
     ++	for i in $(test_seq 3)
     ++	do
     ++		test_commit $i &&
     ++		git branch commits/$i
      +	done &&
      +	git commit-graph write --reachable &&
     -+	test_commit split-commit &&
     -+	git branch -f split-commit &&
     -+	git commit-graph write --reachable --split &&
     -+	test_path_is_file .git/objects/info/commit-graphs/commit-graph-1
     ++	test_path_is_file $infodir/commit-graph &&
     ++	graph_read_expect 3
      +'
      +
     -+graph_git_behavior 'split graph, split-commit vs merge 1' bare split-commit merge/1
     ++graph_git_two_modes() {
     ++	git -c core.commitGraph=true $1 >output
     ++	git -c core.commitGraph=false $1 >expect
     ++	test_cmp expect output
     ++}
     ++
     ++graph_git_behavior() {
     ++	MSG=$1
     ++	BRANCH=$2
     ++	COMPARE=$3
     ++	test_expect_success "check normal git operations: $MSG" '
     ++		graph_git_two_modes "log --oneline $BRANCH" &&
     ++		graph_git_two_modes "log --topo-order $BRANCH" &&
     ++		graph_git_two_modes "log --graph $COMPARE..$BRANCH" &&
     ++		graph_git_two_modes "branch -vv" &&
     ++		graph_git_two_modes "merge-base -a $BRANCH $COMPARE"
     ++	'
     ++}
      +
     -+test_expect_success 'collapse split commit-graph' '
     -+	cd "$TRASH_DIRECTORY/split" &&
     ++graph_git_behavior 'graph exists' commits/3 commits/1
     ++
     ++verify_chain_files_exist() {
     ++	for hash in $(cat $1/commit-graph-chain)
     ++	do
     ++		test_path_is_file $1/graph-$hash.graph
     ++	done
     ++}
     ++
     ++test_expect_success 'add more commits, and write a new base graph' '
     ++	git reset --hard commits/1 &&
     ++	for i in $(test_seq 4 5)
     ++	do
     ++		test_commit $i &&
     ++		git branch commits/$i
     ++	done &&
     ++	git reset --hard commits/2 &&
     ++	for i in $(test_seq 6 10)
     ++	do
     ++		test_commit $i &&
     ++		git branch commits/$i
     ++	done &&
     ++	git reset --hard commits/2 &&
     ++	git merge commits/4 &&
     ++	git branch merge/1 &&
     ++	git reset --hard commits/4 &&
     ++	git merge commits/6 &&
     ++	git branch merge/2 &&
      +	git commit-graph write --reachable &&
     -+	test_path_is_missing .git/objects/info/commit-graphs/commit-graph-1 &&
     -+	test_path_is_file .git/objects/info/commit-graph
     ++	graph_read_expect 12
     ++'
     ++
     ++test_expect_success 'add three more commits, write a tip graph' '
     ++	git reset --hard commits/3 &&
     ++	git merge merge/1 &&
     ++	git merge commits/5 &&
     ++	git merge merge/2 &&
     ++	git branch merge/3 &&
     ++	git commit-graph write --reachable --split &&
     ++	test_path_is_missing $infodir/commit-graph &&
     ++	test_path_is_file $graphdir/commit-graph-chain &&
     ++	ls $graphdir/graph-*.graph >graph-files &&
     ++	test_line_count = 2 graph-files &&
     ++	verify_chain_files_exist $graphdir
     ++'
     ++
     ++graph_git_behavior 'split commit-graph: merge 3 vs 2' merge/3 merge/2
     ++
     ++test_expect_success 'add one commit, write a tip graph' '
     ++	test_commit 11 &&
     ++	git branch commits/11 &&
     ++	git commit-graph write --reachable --split &&
     ++	test_path_is_missing $infodir/commit-graph &&
     ++	test_path_is_file $graphdir/commit-graph-chain &&
     ++	ls $graphdir/graph-*.graph >graph-files &&
     ++	test_line_count = 3 graph-files &&
     ++	verify_chain_files_exist $graphdir
      +'
      +
     -+graph_git_behavior 'collapsed graph, split-commit vs merge 1' bare split-commit merge/1
     ++graph_git_behavior 'three-layer commit-graph: commit 11 vs 6' commits/11 commits/6
      +
     - test_expect_success 'replace-objects invalidates commit-graph' '
     - 	cd "$TRASH_DIRECTORY" &&
     - 	test_when_finished rm -rf replace &&
     ++test_done
 17:  3c52385e56 <  -:  ---------- fetch: add fetch.writeCommitGraph config setting
  -:  ---------- >  9:  9567daa0b8 commit-graph: merge commit-graph chains
  -:  ---------- > 10:  4cfe19a933 commit-graph: allow cross-alternate chains
  -:  ---------- > 11:  72fc0a1f17 commit-graph: expire commit-graph files

-- 
gitgitgadget



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux