Re: [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Abhishek,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> From: Abhishek Kumar <abhishekkumar8222@xxxxxxxxx>
>
> Since there are released versions of Git that understand generation
> numbers in the commit-graph's CDAT chunk but do not understand the GDAT
> chunk, the following scenario is possible:
>
> 1. "New" Git writes a commit-graph with the GDAT chunk.
> 2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

All right.

>
> Because of the current use of inspecting the current layer for a
> chunk_generation_data pointer, the commits in the lower layer will be
> interpreted as having very large generation values (commit date plus
> offset) compared to the generation numbers in the top layer (topological
> level). This violates the expectation that the generation of a parent is
> strictly smaller than the generation of a child.

I think this paragraphs tries too much to be concise, with the result it
is less clear than it could be.  Perhaps it would be better to separate
"what-if" from the current behavior.

  If each layer of split commit-graph is treated independently, as it
  were the case before this commit, with Git inspecting only the current
  layer for chunk_generation_data pointer, commits in the lower layer
  (one with GDAT) would have corrected commit date as their generation
  number, while commits in the upper layer would have topological levels
  as their generation.  Corrected commit dates have usually much larger
  values than topological levels.  This means that if we take two
  commits, one from the upper layer, and one reachable from it in the
  lower layer, then the expectation that the generation of a parent is
  smaller than the generation of a child would be violated.

>
> It is difficult to expose this issue in a test. Since we _start_ with
> artificially low generation numbers, any commit walk that prioritizes
> generation numbers will walk all of the commits with high generation
> number before walking the commits with low generation number. In all the
> cases I tried, the commit-graph layers themselves "protect" any
> incorrect behavior since none of the commits in the lower layer can
> reach the commits in the upper layer.

I don't quite understand the issue here. Unless none of the following
query commands short-circuit and all walk the commit graph regardless of
what generation numbers tell them, they should give different results
with and without the commit graph, if we take two commits one from lower
layer of split commit graph with GDAT, and one commit from the higher
layer without GDAT, one lower reachable from the other higher.

We have the following query commands that we can check:
  $ git merge-base --is-ancestor <lower> <higher>
  $ git merge-base --independent <lower> <higher>
  
  $ git tag --contains <tag-to-lower>
  $ git tag --merged <tag-to-higher>
  $ git branch --contains <branch-to-lower>
  $ git branch --merged <branch-to-higher>

The second set of queries require for those commits to be tagged, or
have branch pointing at them, respectively.

Also, shouldn't `git commit-graph verify` fail with split commit graph
where the top layer is created with GIT_TEST_COMMIT_GRAPH_NO_GDAT=1?


Let's assume that we have the following history, with newer commits
shown on top like in `git log --graph --oneline --all`:

          topological     corrected         generation
          level           commit date       number^*

      d    3                                3
      |
   c  |    3                                3
   |  |                                                 without GDAT
 ..|..|.....[layer.boundary]........................................
   |  |                                                    with GDAT
   |  b    2              1112912113        1112912113
   |  |
   a  |    2              1112912053        1112912053
   | /
   |/
   r       1              1112911993        1112911993

*) each layer inspected individually.

With such history, we can for example reach 'a' from 'c', thus
`git merge-base --is-ancestor a b` should return true value, but
without this commit gen(a) > gen(c), instead of gen(a) <= gen(c);
I use here weaker reachability condition, but the one that works
also for commits outside the commit-graph (and those for which
generation numbers overflows).

>
> This issue would manifest itself as a performance problem in this case,
> especially with something like "git log --graph" since the low
> generation numbers would cause the in-degree queue to walk all of the
> commits in the lower layer before allowing the topo-order queue to write
> anything to output (depending on the size of the upper layer).

All right, that's good explanation.

>
> When writing the new layer in split commit-graph, we write a GDAT chunk
> only if the topmost layer has a GDAT chunk. This guarantees that if a
> layer has GDAT chunk, all lower layers must have a GDAT chunk as well.
>
> Rewriting layers follows similar approach: if the topmost layer below
> the set of layers being rewritten (in the split commit-graph chain)
> exists, and it does not contain GDAT chunk, then the result of rewrite
> does not have GDAT chunks either.

All right, very good explanation; the only minor suggestion would be to
add some 'intro' to the first of those two paragraphs, for example:

  Therefore, when writing the new layer in split commit-graph...

Though I am not sure if it is necessary.

>
> Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@xxxxxxxxx>
> ---
>  commit-graph.c                | 29 +++++++++++-
>  commit-graph.h                |  1 +
>  t/t5324-split-commit-graph.sh | 86 +++++++++++++++++++++++++++++++++++
>  3 files changed, 115 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 71d0b243db..5d15a1399b 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -605,6 +605,21 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
>  	return graph_chain;
>  }
>  
> +static void validate_mixed_generation_chain(struct commit_graph *g)
> +{
> +	int read_generation_data;
> +
> +	if (!g)
> +		return;
> +
> +	read_generation_data = !!g->chunk_generation_data;
> +
> +	while (g) {
> +		g->read_generation_data = read_generation_data;
> +		g = g->base_graph;
> +	}
> +}

All right, this function checks assumedly topmost layer if it is
GDAT-less, and if it is propagates this status down the layers of split
commit graph.  This is needed because if we have mixed-generation commit
graph, then for each and every layer we need to use topological levels
as generation number.

The only minor issue is the name of this function (it does not hint that
it propagates the GDAT status downwards), but I don't have a better
idea, unfortunately.  And it does reflect what this function is used for.

> +
>  struct commit_graph *read_commit_graph_one(struct repository *r,
>  					   struct object_directory *odb)
>  {
> @@ -613,6 +628,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
>  	if (!g)
>  		g = load_commit_graph_chain(r, odb);
>  
> +	validate_mixed_generation_chain(g);
> +

All right, this looks like a good place to put this new check: just
after reading commit-graph chain.

>  	return g;
>  }
>  
> @@ -782,7 +799,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>  
> -	if (g->chunk_generation_data) {
> +	if (g->chunk_generation_data && g->read_generation_data) {
>  		offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);

All right, instead of simply checking if the current layer has
generation data chunk, we need to also check if the whole graph allows
for it (if there are no mixed-generation layers).

The g->read_generation_data should be filled correctly, because
fill_commit_graph_info() is always preceded by read_commit_graph(), if I
understand it correctly.

>  
>  		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
> @@ -2030,6 +2047,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
>  		}
>  	}
>  
> +	if (!ctx->write_generation_data && g->chunk_generation_data)
> +		ctx->write_generation_data = 1;
> +

This needs more careful examination, and looking at larger context of
those lines.

At this point, unless `--split=replace` option is used, 'g' points to
the bottom layer out of all topmost layers being merged. We know that if
there are GDAT-less layers then these must be top layers, so this means
that we can write GDAT chunk in the result of the merge -- because we
would be replacing all possible GDAT-less layers (and maybe some with
GDAT) with a single layer with the GDAT chunk.

The ctx->write_generation_data is set to true unless environment
variable GIT_TEST_COMMIT_GRAPH_NO_GDAT is true, and that in
write_commit_graph() it would be set to false if topmost layer doesn't
have GDAT chunk, and to true if `--split=replace` option is used; see
below.

Looks good to me.


NOTE that this means that GIT_TEST_COMMIT_GRAPH_NO_GDAT prevents from
writing GDAT chunk with generation data v2 unless we are merging layers,
or replacing all of them with a single layer: then it is _ignored_.

Should we clarify this fact in the description of GIT_TEST_COMMIT_GRAPH_NO_GDAT
in t/README?  Currently it reads:

  GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
  commit-graph to be written without generation data chunk.

>  	if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
>  		ctx->new_base_graph = g;
>  	else if (ctx->num_commit_graphs_after != 1)
> @@ -2274,6 +2294,7 @@ int write_commit_graph(struct object_directory *odb,
>  		struct commit_graph *g = ctx->r->objects->commit_graph;
>  
>  		while (g) {
> +			g->read_generation_data = 1;
>  			g->topo_levels = &topo_levels;
>  			g = g->base_graph;
>  		}

All right, when writing the commit graph we want to make use of existing
generation data chunks.  This is safe, because when computing generation
numbers for writing we have separate place to store topoogical levels
(`topo_levels`) so they would not be mixed with corrected commit dates:
generation number v1 and v2 are kept separate.

> @@ -2300,6 +2321,9 @@ int write_commit_graph(struct object_directory *odb,
>  
>  		g = ctx->r->objects->commit_graph;
>  
> +		if (g && !g->chunk_generation_data)
> +			ctx->write_generation_data = 0;
> +

All right, if current (topmost) layed does not have GDAT, then when
creating a new layer do not create GDAT layer either (merging layers and
rewriting the commit-graph is handled separately).

>  		while (g) {
>  			ctx->num_commit_graphs_before++;
>  			g = g->base_graph;
> @@ -2318,6 +2342,9 @@ int write_commit_graph(struct object_directory *odb,
>  
>  		if (ctx->opts)
>  			replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
> +
> +		if (replace)
> +			ctx->write_generation_data = 1;

All right, when replacing all layers (`git commit-graph write --split=replace`),
then we can safely write the GDAT chunk.

Note however that here we don't take into account the value of the
environment variable GIT_TEST_COMMIT_GRAPH_NO_GDAT.  Which maybe is what
we want...

>  	}
>  
>  	ctx->approx_nr_objects = approximate_object_count();
> diff --git a/commit-graph.h b/commit-graph.h
> index 19a02001fd..ad52130883 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -64,6 +64,7 @@ struct commit_graph {
>  	struct object_directory *odb;
>  
>  	uint32_t num_commits_in_base;
> +	unsigned int read_generation_data;
>  	struct commit_graph *base_graph;

All right, this new field is here to propagate to each layer the
information whether we can read from the generation number v2 data
chunk.

Though I am not sure whether this field should be added here, and
whether it should be `unsigned int` (we don't have to be that careful
about saving space for this type).

>  
>  	const uint32_t *chunk_oid_fanout;
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index 651df89ab2..d0949a9eb8 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -440,4 +440,90 @@ test_expect_success '--split=replace with partial Bloom data' '
>  	verify_chain_files_exist $graphdir
>  '
>  
> +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
> +	mkdir mixed &&

This should probably go just before cd-ing into just created
subdirectory.

> +	graphdir=".git/objects/info/commit-graphs" &&
> +	test_oid_cache <<-EOM &&
> +	oid_version sha1:1
> +	oid_version sha256:2
> +	EOM

Minor nitpick: Why use "EOM", which is used only twice in Git the test
suite, and not the conventional "EOF" (used at least 4000 times)?

> +	cd "$TRASH_DIRECTORY/mixed" &&

The t/README says:

   - Don't chdir around in tests.  It is not sufficient to chdir to
     somewhere and then chdir back to the original location later in
     the test, as any intermediate step can fail and abort the test,
     causing the next test to start in an unexpected directory.  Do so
     inside a subshell if necessary.

Though I am not sure if it should apply also to this situation.

> +	git init &&
> +	git config core.commitGraph true &&
> +	git config gc.writeCommitGraph false &&

All right.

> +	for i in $(test_seq 3)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/1 &&
> +	for i in $(test_seq 4 5)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/2 &&
> +	for i in $(test_seq 6 10)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git commit-graph write --reachable --split &&

Is there a reason why we do not check just written commit-graph file
with `test-tool read-graph >output-layer-1`?

> +	git reset --hard commits/2 &&
> +	git merge commits/4 &&

Shouldn't we use `test_merge` instead of `git merge`; I am not sure when
to use one or the other?

> +	git branch merge/1 &&
> +	git reset --hard commits/4 &&
> +	git merge commits/6 &&
> +	git branch merge/2 &&

It would be nice to have ASCII-art of the history (of the graph of
revisions) created here for subsequent tests:

                                        
           /- 6 <-- 7 <-- 8 <-- 9 <-- 10*
          /    \-\
         /        \
  1 <-- 2 <-- 3*   \--\
  |      \             \ 
  |       \-----\       \
   \             \       \
    \-- 4*<------ M/1     M/2
        |\               /  
        | \-- 5*        /
        \              /
         \------------/

  * - 1st layer  

Though I am not sure if what I have created is readable; I think a
better way to draw this graph is possible, for example:

               /- 3*
              /
             /
  1 <------ 2 <---- 6 <-- 7 <-- 8 <-- 9 <-- 10*
   \         \       \
    \         \       \
     \         \       \
      \- 4* <-- M/1     \    
         |\              \
         | \------------- M/2
         \
          \---- 5*

Edit: as I see the history gets even more complicated, so perhaps
ASCII-art diagram of the history with layers marked would be too
complicated, and wouldn't bring much.

Why do we need such shape of the history in the repository?

> +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 $(test_oid oid_version) 4 1
> +	num_commits: 2
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output &&

All right, we check that we have 2 commits, and that there is no GDAT
chunk.

> +	git commit-graph verify

All right, we verify commit-graph as a whole (both layers).

> +'
> +
> +test_expect_success 'does not write generation data chunk if not present on existing tip' '

Hmmm... I wonder if we can come up with a better name for this test;
for example should it be "does not write" or "do not write"?

> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git reset --hard commits/3 &&
> +	git merge merge/1 &&
> +	git merge commits/5 &&
> +	git merge merge/2 &&
> +	git branch merge/3 &&

The commit graph gets complicated, so it would not be easy to visualize
it with ASCII-art diagram without any crossed lines.  Maybe `git log
--graph --oneline --all` would help:

*   (merge/3) Merge branch 'merge/2'
|\
| *   (merge/2) Merge branch 'commits/6'
| |\
* | \   Merge branch 'commits/5'
|\ \ \
| * | | (commits/5) 5
| |/ /
* | |   Merge branch 'merge/1'
|\ \ \
| * | | (merge/1) Merge branch 'commits/4'
| |\| |
| | * | (commits/4) 4
* | | | (commits/3) 3
|/ / /
| | | * (commits/10) 10
| | | * (commits/9) 9
| | | * (commits/8) 8
| | | * (commits/7) 7
| | |/
| | * (commits/6) 6
| |/
|/|
* | (commits/2) 2
|/
* (commits/1) 1


> +	git commit-graph write --reachable --split=no-merge &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 $(test_oid oid_version) 4 2
> +	num_commits: 3
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output &&
> +	git commit-graph verify

All right, so here we check that we have layer without GDAT at the top,
and we request not to merge layers thus new layer will be created, then
the new layer also does not have GDAT chunk (and has 3 commits).

Minor nitpick: shouldn't those test be indented?

> +'
> +
> +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git commit-graph write --reachable --split=replace &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	test_line_count = 1 $graphdir/commit-graph-chain &&
> +	verify_chain_files_exist $graphdir &&

All right, this checks that we have split commit-graph chain that
consist of a single layer, and that the commit-graph file for this
single layer exists.

> +	graph_read_expect 15 &&

Shouldn't we use `test-tool read-graph` to check whether generation_data
chunk is present... ah, sorry, I have realized that after previous
patches `graph_read_expect 15` implicitly checks the latter, because in
its' use of `test-tool read-graph` it does expect generation_data chunk.

So we use `test-tool read-graph` manually to check that generation_data
chunk is absent, and we use graph_read_expect to check that it is
present (and in both cases that the number of commits matches).  I
wonder if it would be possible to simplify that...

> +	git commit-graph verify

All right.

> +'
> +
> +test_expect_success 'add one commit, write a tip graph' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	test_commit 11 &&
> +	git branch commits/11 &&
> +	git commit-graph write --reachable --split &&
> +	test_path_is_missing $infodir/commit-graph &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	ls $graphdir/graph-*.graph >graph-files &&
> +	test_line_count = 2 graph-files &&
> +	verify_chain_files_exist $graphdir
> +'

What it is meant to test?  That adding single-commit to a 15 commit
commit-graph file in split mode does not result in layers merging, and
actually adds a new layer: we check that we have exactly two layers and
that they are all OK.

We don't check here that the newly created top layer commit-graph does
have GDAT chunk, as it should be if the top layer (in this case the only
layer) has GDAT chunk.

> +
>  test_done

One test we are missing is testing that merging layers is done
correctly, namely that if we are merging layers in split commit-graph
file, and the layer below the ones we are merging lacks GDAT chunk, then
the result of the merge should also be without GDAT chunk.  This would
require at least two GDAT-less layers in a setup.

I'm not sure how difficult writing such test should be.

Best,
-- 
Jakub Narębski




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux