Re: [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does

Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> · Wed, 26 Aug 2020 12:45:19 +0530

On Sat, Aug 22, 2020 at 07:14:38PM +0200, Jakub Narębski wrote:
> Hi Abhishek,
> 
> ... 
> 
> However the commit message do not say anything about the *writing* side.
> 

Revised the commit message to include the following at the end:

When writing the new layer in split commit-graph, we write a GDAT chunk
only if the topmost layer has a GDAT chunk. This guarantees that if a
layer has GDAT chunk, all lower layers must have a GDAT chunk as well.

Rewriting layers follows similar approach: if the topmost layer below
set of layers being rewritten (in the split commit-graph chain) exists,
and it does not contain GDAT chunk, then the result of rewrite does not
have GDAT chunks either.

> 
> ...
> 
> To be more detailed, without '--split=replace' we would want the following
> layer merging behavior:
> 
>    [layer with GDAT][with GDAT][without GDAT][without GDAT][without GDAT]
>            1              2           3             4            5
> 
> In the split commit-graph chain above, merging two topmost layers
> (layers 4 and 5) should create a layer without GDAT; merging three
> topmost layers (and any other layers, e.g. two middle ones, i.e. 3 and
> 4) should create a new layer with GDAT.
> 
>    [layer with GDAT][with GDAT][without GDAT][-------without GDAT-------]
>            1              2           3               merged
> 
>    [layer with GDAT][with GDAT][-------------with GDAT------------------]
>            1              2                    merged
> 
> I hope those ASCII-art pictures help understanding it
> 

Thanks! There were helpful.

While we work as expected in the first scenario i.e merging 4 and 5, we
would *still* write a layer without GDAT in the second scenario.

I have tweaked split_graph_merge_strategy() to fix this:

----------------------------------------------

diff --git a/commit-graph.c b/commit-graph.c
index 6d54d9a286..246fad030d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1973,6 +1973,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 		}
 	}
 
+	if (!ctx->write_generation_data && g->chunk_generation_data)
+		ctx->write_generation_data = 1;
+
 	if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
 		ctx->new_base_graph = g;
 	else if (ctx->num_commit_graphs_after != 1)

----------------------------------------------------

That is, if we were not writing generation data (because of mixed
generation concerns) but the new topmost layer has a generation data
chunk, we have merged all layers without GDAT chunk and can now write a
GDAT chunk safely.

> >
> > It is difficult to expose this issue in a test. Since we _start_ with
> > artificially low generation numbers, any commit walk that prioritizes
> > generation numbers will walk all of the commits with high generation
> > number before walking the commits with low generation number. In all the
> > cases I tried, the commit-graph layers themselves "protect" any
> > incorrect behavior since none of the commits in the lower layer can
> > reach the commits in the upper layer.
> >
> > This issue would manifest itself as a performance problem in this case,
> > especially with something like "git log --graph" since the low
> > generation numbers would cause the in-degree queue to walk all of the
> > commits in the lower layer before allowing the topo-order queue to write
> > anything to output (depending on the size of the upper layer).
> 
> Wouldn't breaking the reachability condition promise make some Git
> commands to return *incorrect* results if they short-circuit, stop
> walking if generation number shows that A cannot reach B?
> 
> I am talking here about commands that return boolean, or select subset
> from given set of revisions:
> - git merge-base --is-ancestor <B> <A>
> - git branch branch-A <A> && git branch --contains <B>
> - git branch branch-B <B> && git branch --merged <A>
> 
> Git assumes that generation numbers fulfill the following condition:
> 
>   if A can reach B, then gen(A) > gen(B)
> 
> Notably this includes commits not in commit-graph, and clamped values.
> 
> However, in the following case
> 
> * if commit A is from higher layer without GDAT
>   and uses topological levels for 'generation', e.g. 115 (in a small repo)
> * and commit B is from lower layer with GDAT
>   and uses corrected commit date as 'generation', for example 1598112896,
> 
> it may happen that A (later commit) can reach B (earlier commit), but
> gen(B) > gen(A).  The reachability condition promise for generation
> numbers is broken.
> 
> >
> > Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@xxxxxxxxx>
> > ---
> 
> I have reordered files in the patch itself to make it easier to review
> the proposed changes.
> 
> >  commit-graph.h                |  1 +
> >  commit-graph.c                | 32 +++++++++++++++-
> >  t/t5324-split-commit-graph.sh | 70 +++++++++++++++++++++++++++++++++++
> >  3 files changed, 102 insertions(+), 1 deletion(-)
> >
> > diff --git a/commit-graph.h b/commit-graph.h
> > index f78c892fc0..3cf89d895d 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -63,6 +63,7 @@ struct commit_graph {
> >  	struct object_directory *odb;
> >
> >  	uint32_t num_commits_in_base;
> > +	uint32_t read_generation_data;
> >  	struct commit_graph *base_graph;
> >
> 
> First, why `read_generation_data` is of uint32_t type, when it stores
> (as far as I understand it), a "boolean" value of either 0 or 1?
> 

Yes, using unsigned int instead of uint32_t (although in most of cases
it would be same).  If commit_graph had other flags as well, we could
have used a bit field.

> Second, couldn't we simply set chunk_generation_data to NULL?  Or would
> that interfere with the case of rewriting, where we want to use existing
> GDAT data when writing new commit-graph with GDAT chunk?

It interferes with rewriting the split commit-graph, as you might have
guessed from the above code snippet.

> 
> ...
>
> > diff --git a/commit-graph.c b/commit-graph.c
> 
> >  		graph_data->generation = item->date +
> >  			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> >  	else
> > @@ -885,6 +908,7 @@ void load_commit_graph_info(struct repository *r, struct commit *item)
> >  	uint32_t pos;
> >  	if (!prepare_commit_graph(r))
> >  		return;
> > +
> >  	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
> >  		fill_commit_graph_info(item, r->objects->commit_graph, pos);
> >  }
> 
> This is unrelated whitespace fix, a "while at it" in neighbourhood of
> changes.  All right then.
> 

Reverted this change, as it's unimportant.

> > @@ -2192,6 +2216,9 @@ int write_commit_graph(struct object_directory *odb,
>
> ...
> 
> It would be nice to have an example with merging layers (whether we
> would handle it in strict or relaxed way).
> 

Sure, will add.

> > +
> >  test_done
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek