Re: [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does

Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> · Thu, 12 Nov 2020 15:31:33 +0530

On Sun, Nov 01, 2020 at 01:55:11AM +0100, Jakub Narębski wrote:
> Hi Abhishek,
> 
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> > From: Abhishek Kumar <abhishekkumar8222@xxxxxxxxx>
> >
> > Since there are released versions of Git that understand generation
> > numbers in the commit-graph's CDAT chunk but do not understand the GDAT
> > chunk, the following scenario is possible:
> >
> > 1. "New" Git writes a commit-graph with the GDAT chunk.
> > 2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
> 
> All right.
> 
> >
> > Because of the current use of inspecting the current layer for a
> > chunk_generation_data pointer, the commits in the lower layer will be
> > interpreted as having very large generation values (commit date plus
> > offset) compared to the generation numbers in the top layer (topological
> > level). This violates the expectation that the generation of a parent is
> > strictly smaller than the generation of a child.
> 
> I think this paragraphs tries too much to be concise, with the result it
> is less clear than it could be.  Perhaps it would be better to separate
> "what-if" from the current behavior.
> 
>   If each layer of split commit-graph is treated independently, as it
>   were the case before this commit, with Git inspecting only the current
>   layer for chunk_generation_data pointer, commits in the lower layer
>   (one with GDAT) would have corrected commit date as their generation
>   number, while commits in the upper layer would have topological levels
>   as their generation.  Corrected commit dates have usually much larger
>   values than topological levels.  This means that if we take two
>   commits, one from the upper layer, and one reachable from it in the
>   lower layer, then the expectation that the generation of a parent is
>   smaller than the generation of a child would be violated.
> 

Thanks, that's better.

> >
> > It is difficult to expose this issue in a test. Since we _start_ with
> > artificially low generation numbers, any commit walk that prioritizes
> > generation numbers will walk all of the commits with high generation
> > number before walking the commits with low generation number. In all the
> > cases I tried, the commit-graph layers themselves "protect" any
> > incorrect behavior since none of the commits in the lower layer can
> > reach the commits in the upper layer.
> 
> I don't quite understand the issue here. Unless none of the following
> query commands short-circuit and all walk the commit graph regardless of
> what generation numbers tell them, they should give different results
> with and without the commit graph, if we take two commits one from lower
> layer of split commit graph with GDAT, and one commit from the higher
> layer without GDAT, one lower reachable from the other higher.
> 
> We have the following query commands that we can check:
>   $ git merge-base --is-ancestor <lower> <higher>
>   $ git merge-base --independent <lower> <higher>
>   
>   $ git tag --contains <tag-to-lower>
>   $ git tag --merged <tag-to-higher>
>   $ git branch --contains <branch-to-lower>
>   $ git branch --merged <branch-to-higher>
> 
> The second set of queries require for those commits to be tagged, or
> have branch pointing at them, respectively.
> 
> Also, shouldn't `git commit-graph verify` fail with split commit graph
> where the top layer is created with GIT_TEST_COMMIT_GRAPH_NO_GDAT=1?
> 
> Let's assume that we have the following history, with newer commits
> shown on top like in `git log --graph --oneline --all`:
> 
>           topological     corrected         generation
>           level           commit date       number^*
> 
>       d    3                                3
>       |
>    c  |    3                                3
>    |  |                                                 without GDAT
>  ..|..|.....[layer.boundary]........................................
>    |  |                                                    with GDAT
>    |  b    2              1112912113        1112912113
>    |  |
>    a  |    2              1112912053        1112912053
>    | /
>    |/
>    r       1              1112911993        1112911993
> 
> *) each layer inspected individually.
> 
> With such history, we can for example reach 'a' from 'c', thus
> `git merge-base --is-ancestor a b` should return true value, but
> without this commit gen(a) > gen(c), instead of gen(a) <= gen(c);
> I use here weaker reachability condition, but the one that works
> also for commits outside the commit-graph (and those for which
> generation numbers overflows).
> 

The original explanation was given by Dr. Stolee and he might not have
thought exhaustively about the issue.

In any case, your explanation and the history make sense to me. I will
try to add test and report back to the mailing list if something goes
wrong.

Thank you for clarifying in such detail.

> >
> > This issue would manifest itself as a performance problem in this case,
> > especially with something like "git log --graph" since the low
> > generation numbers would cause the in-degree queue to walk all of the
> > commits in the lower layer before allowing the topo-order queue to write
> > anything to output (depending on the size of the upper layer).
> 
> All right, that's good explanation.
> 
> ...
>
> > @@ -2030,6 +2047,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
> >  		}
> >  	}
> >  
> > +	if (!ctx->write_generation_data && g->chunk_generation_data)
> > +		ctx->write_generation_data = 1;
> > +
> 
> This needs more careful examination, and looking at larger context of
> those lines.
> 
> At this point, unless `--split=replace` option is used, 'g' points to
> the bottom layer out of all topmost layers being merged. We know that if
> there are GDAT-less layers then these must be top layers, so this means
> that we can write GDAT chunk in the result of the merge -- because we
> would be replacing all possible GDAT-less layers (and maybe some with
> GDAT) with a single layer with the GDAT chunk.
> 
> The ctx->write_generation_data is set to true unless environment
> variable GIT_TEST_COMMIT_GRAPH_NO_GDAT is true, and that in
> write_commit_graph() it would be set to false if topmost layer doesn't
> have GDAT chunk, and to true if `--split=replace` option is used; see
> below.
> 
> Looks good to me.
> 
> 
> NOTE that this means that GIT_TEST_COMMIT_GRAPH_NO_GDAT prevents from
> writing GDAT chunk with generation data v2 unless we are merging layers,
> or replacing all of them with a single layer: then it is _ignored_.
> 
> Should we clarify this fact in the description of GIT_TEST_COMMIT_GRAPH_NO_GDAT
> in t/README?  Currently it reads:
> 
>   GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
>   commit-graph to be written without generation data chunk.

I think it's better to *not* write generation data chunk if
GIT_TEST_COMMIT_GRAPH_NO_GDAT is set even though all GDAT-less layers
are merged, that is:

  if (!ctx->write_generation_data &&
      g->chunk_generation_data &&
     !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
    ctx->write_generation_data = 1;

With this change, we would have a method to force-write commit-graph
without generation data chunk regardless of the shape of split
commit-graph files.

> 
> ...
>
> > diff --git a/commit-graph.h b/commit-graph.h
> > index 19a02001fd..ad52130883 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -64,6 +64,7 @@ struct commit_graph {
> >  	struct object_directory *odb;
> >  
> >  	uint32_t num_commits_in_base;
> > +	unsigned int read_generation_data;
> >  	struct commit_graph *base_graph;
> 
> All right, this new field is here to propagate to each layer the
> information whether we can read from the generation number v2 data
> chunk.
> 
> Though I am not sure whether this field should be added here, and
> whether it should be `unsigned int` (we don't have to be that careful
> about saving space for this type).
> 

I cannot think of a more appropriate struct than `struct commit_graph`. 
Any particular suggestions?

> > 
> >  	const uint32_t *chunk_oid_fanout;
> > diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> > index 651df89ab2..d0949a9eb8 100755
> > --- a/t/t5324-split-commit-graph.sh
> > +++ b/t/t5324-split-commit-graph.sh
> > @@ -440,4 +440,90 @@ test_expect_success '--split=replace with partial Bloom data' '
> >  	verify_chain_files_exist $graphdir
> >  '
> >  
> > +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
> > +	mkdir mixed &&
> 
> This should probably go just before cd-ing into just created
> subdirectory.
> 
> > +	graphdir=".git/objects/info/commit-graphs" &&
> > +	test_oid_cache <<-EOM &&
> > +	oid_version sha1:1
> > +	oid_version sha256:2
> > +	EOM
> 
> Minor nitpick: Why use "EOM", which is used only twice in Git the test
> suite, and not the conventional "EOF" (used at least 4000 times)?

Right, both instances of "EOM" are actually my own. I looked up some
test script for oid cache that did use EOM when I first wrote the tests
but it's changed now. Will replace.
> 
> > +	cd "$TRASH_DIRECTORY/mixed" &&
> 
> The t/README says:
> 
>    - Don't chdir around in tests.  It is not sufficient to chdir to
>      somewhere and then chdir back to the original location later in
>      the test, as any intermediate step can fail and abort the test,
>      causing the next test to start in an unexpected directory.  Do so
>      inside a subshell if necessary.
> 
> Though I am not sure if it should apply also to this situation.

While I cannot avoid changing directory, using a subshell would be best
to avoid causing the later tests to start in unexpected directories.

> 
> > +	git init &&
> > +	git config core.commitGraph true &&
> > +	git config gc.writeCommitGraph false &&
> 
> All right.
> 
> > +	for i in $(test_seq 3)
> > +	do
> > +		test_commit $i &&
> > +		git branch commits/$i || return 1
> > +	done &&
> > +	git reset --hard commits/1 &&
> > +	for i in $(test_seq 4 5)
> > +	do
> > +		test_commit $i &&
> > +		git branch commits/$i || return 1
> > +	done &&
> > +	git reset --hard commits/2 &&
> > +	for i in $(test_seq 6 10)
> > +	do
> > +		test_commit $i &&
> > +		git branch commits/$i || return 1
> > +	done &&
> > +	git commit-graph write --reachable --split &&
> 
> Is there a reason why we do not check just written commit-graph file
> with `test-tool read-graph >output-layer-1`?

We could check the written commit-graph file at this point but it's same
as existing tests as above.

> 
> > +	git reset --hard commits/2 &&
> > +	git merge commits/4 &&
> 
> Shouldn't we use `test_merge` instead of `git merge`; I am not sure when
> to use one or the other?

`test_merge` is used in 26 places whereas `git merge` is used in over a
thousand places. `test_merge` is just not widely adopted and this lack
of adoption prevents further use.

> 
> > +	git branch merge/1 &&
> > +	git reset --hard commits/4 &&
> > +	git merge commits/6 &&
> > +	git branch merge/2 &&
> 
> It would be nice to have ASCII-art of the history (of the graph of
> revisions) created here for subsequent tests:
> 
>                                         
>            /- 6 <-- 7 <-- 8 <-- 9 <-- 10*
>           /    \-\
>          /        \
>   1 <-- 2 <-- 3*   \--\
>   |      \             \ 
>   |       \-----\       \
>    \             \       \
>     \-- 4*<------ M/1     M/2
>         |\               /  
>         | \-- 5*        /
>         \              /
>          \------------/
> 
>   * - 1st layer  
> 
> Though I am not sure if what I have created is readable; I think a
> better way to draw this graph is possible, for example:
> 
>                /- 3*
>               /
>              /
>   1 <------ 2 <---- 6 <-- 7 <-- 8 <-- 9 <-- 10*
>    \         \       \
>     \         \       \
>      \         \       \
>       \- 4* <-- M/1     \    
>          |\              \
>          | \------------- M/2
>          \
>           \---- 5*
> 
> Edit: as I see the history gets even more complicated, so perhaps
> ASCII-art diagram of the history with layers marked would be too
> complicated, and wouldn't bring much.
> 
> Why do we need such shape of the history in the repository?

We don't need such a complicated shape. Any commit-graph file with 2-3
layers regardless of how commits are related should suffice. Will
simplify.

> 
> > +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
> > +	test-tool read-graph >output &&
> > +	cat >expect <<-EOF &&
> > +	header: 43475048 1 $(test_oid oid_version) 4 1
> > +	num_commits: 2
> > +	chunks: oid_fanout oid_lookup commit_metadata
> > +	EOF
> > +	test_cmp expect output &&
> 
> All right, we check that we have 2 commits, and that there is no GDAT
> chunk.
> 
> > +	git commit-graph verify
> 
> All right, we verify commit-graph as a whole (both layers).
> 
> > +'
> > +
> > +test_expect_success 'does not write generation data chunk if not present on existing tip' '
> 
> Hmmm... I wonder if we can come up with a better name for this test;
> for example should it be "does not write" or "do not write"?

That's better.

> 
> > +	cd "$TRASH_DIRECTORY/mixed" &&
> > +	git reset --hard commits/3 &&
> > +	git merge merge/1 &&
> > +	git merge commits/5 &&
> > +	git merge merge/2 &&
> > +	git branch merge/3 &&
> 
> The commit graph gets complicated, so it would not be easy to visualize
> it with ASCII-art diagram without any crossed lines.  Maybe `git log
> --graph --oneline --all` would help:
> 
> *   (merge/3) Merge branch 'merge/2'
> |\
> | *   (merge/2) Merge branch 'commits/6'
> | |\
> * | \   Merge branch 'commits/5'
> |\ \ \
> | * | | (commits/5) 5
> | |/ /
> * | |   Merge branch 'merge/1'
> |\ \ \
> | * | | (merge/1) Merge branch 'commits/4'
> | |\| |
> | | * | (commits/4) 4
> * | | | (commits/3) 3
> |/ / /
> | | | * (commits/10) 10
> | | | * (commits/9) 9
> | | | * (commits/8) 8
> | | | * (commits/7) 7
> | | |/
> | | * (commits/6) 6
> | |/
> |/|
> * | (commits/2) 2
> |/
> * (commits/1) 1
> 
> 
> > +	git commit-graph write --reachable --split=no-merge &&
> > +	test-tool read-graph >output &&
> > +	cat >expect <<-EOF &&
> > +	header: 43475048 1 $(test_oid oid_version) 4 2
> > +	num_commits: 3
> > +	chunks: oid_fanout oid_lookup commit_metadata
> > +	EOF
> > +	test_cmp expect output &&
> > +	git commit-graph verify
> 
> All right, so here we check that we have layer without GDAT at the top,
> and we request not to merge layers thus new layer will be created, then
> the new layer also does not have GDAT chunk (and has 3 commits).
> 
> Minor nitpick: shouldn't those test be indented?
> 

The tests look indented to me and `git diff HEAD^ --check` gives nothing.

Did you mean the lines enclosed by EOF delimiter?

> > +'
> > +
> > +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
> > +	cd "$TRASH_DIRECTORY/mixed" &&
> > +	git commit-graph write --reachable --split=replace &&
> > +	test_path_is_file $graphdir/commit-graph-chain &&
> > +	test_line_count = 1 $graphdir/commit-graph-chain &&
> > +	verify_chain_files_exist $graphdir &&
> 
> All right, this checks that we have split commit-graph chain that
> consist of a single layer, and that the commit-graph file for this
> single layer exists.
> 
> > +	graph_read_expect 15 &&
> 
> Shouldn't we use `test-tool read-graph` to check whether generation_data
> chunk is present... ah, sorry, I have realized that after previous
> patches `graph_read_expect 15` implicitly checks the latter, because in
> its' use of `test-tool read-graph` it does expect generation_data chunk.
> 
> So we use `test-tool read-graph` manually to check that generation_data
> chunk is absent, and we use graph_read_expect to check that it is
> present (and in both cases that the number of commits matches).  I
> wonder if it would be possible to simplify that...
>

The problem here is graph_read_expect() as defined in
t5324-split-commit-graph takes two parameters - number of commits and
number of base graphs. If the number of base graphs is not passed to
the function call, it's assumed to be zero. Using a default parameter
is tricky - I can fix it by manually adding a zero to each of 
graph_read_expect() in an additional preparatory patch.

Any other suggestions are welcome too.

> 
> > +	git commit-graph verify
> 
> All right.
> 
> > +'
> > +
> > +test_expect_success 'add one commit, write a tip graph' '
> > +	cd "$TRASH_DIRECTORY/mixed" &&
> > +	test_commit 11 &&
> > +	git branch commits/11 &&
> > +	git commit-graph write --reachable --split &&
> > +	test_path_is_missing $infodir/commit-graph &&
> > +	test_path_is_file $graphdir/commit-graph-chain &&
> > +	ls $graphdir/graph-*.graph >graph-files &&
> > +	test_line_count = 2 graph-files &&
> > +	verify_chain_files_exist $graphdir
> > +'
> 
> What it is meant to test?  That adding single-commit to a 15 commit
> commit-graph file in split mode does not result in layers merging, and
> actually adds a new layer: we check that we have exactly two layers and
> that they are all OK.

This test is meant to check writing to a split graph in "normal"
conditions (i.e. all existing layers have generation data chunk). The
above tests are special cases as they involve merging layers with mixed 
generation number versions.

> 
> We don't check here that the newly created top layer commit-graph does
> have GDAT chunk, as it should be if the top layer (in this case the only
> layer) has GDAT chunk.
> > +
> >  test_done
> 
> One test we are missing is testing that merging layers is done
> correctly, namely that if we are merging layers in split commit-graph
> file, and the layer below the ones we are merging lacks GDAT chunk, then
> the result of the merge should also be without GDAT chunk.  This would
> require at least two GDAT-less layers in a setup.
> 
> I'm not sure how difficult writing such test should be.

It wouldn't be too hard. 

After the last test, I can write some more commits and write split 
commit-graph file without GDAT chunk. Then write some more commits 
and merge layers using `git commit-graph write --max-commits=<nr>`.

Thanks for pointing this out!

> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek