Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Aug 21, 2020 at 08:43:38PM +0200, Jakub Narębski wrote:
> Hello,
> 
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@xxxxxxxxx>
> >
> > As we are writing topological levels to commit data chunk to ensure
> > backwards compatibility with "Old" Git and the member `generation` of
> > struct commit_graph_data will store corrected commit date in a later
> > commit, let's introduce a commit-slab to store topological levels while
> > writing commit-graph.
> 
> In my opinion the above it would be easier to follow if rephrased in the
> following way:
> 
>   In a later commit we will introduce corrected commit date as the
>   generation number v2.  This value will be stored in the new separate
>   GDAT chunk.  However to ensure backwards compatibility with "Old" Git
>   we need to continue to write generation number v1, which is
>   topological level, to the commit data chunk (CDAT).  This means that
>   we need to compute both versions of generation numbers when writing
>   the commit-graph file.  Let's therefore introduce a commit-slab
>   to store topological levels; corrected commit date will be stored
>   in the member `generation` of struct commit_graph_data.
> 
> What do you think?
> 

Yes, that's better.

> 
> By the way, do I understand it correctly that in backward-compatibility
> mode (that is, in mixed-version environment where at least some
> commit-graph files were written by "Old" Git and are lacking GDAT chunk
> and generation number v2 data) the `generation` member of commit graph
> data chunk will be populated and will store generation number v1, that
> is topological level? And that the commit-slab for topological levels is
> only there for writing and re-writing?
> 

No, the topo_levels commit-slab would be always populated when we write
a commit data chunk. The topo_level slab is a workaround for the fact
that we are trying to write two independent values (corrected commit
date offset, topological levels) but have one struct member to store them in
(data->generation).

If we are in a mixed-version environment, we could avoid initializing
the slab and fill the topological levels into data->generation instead,
but that's not how it is implemented right now.

> >
> > When Git creates a split commit-graph, it takes advantage of the
> > generation values that have been computed already and present in
> > existing commit-graph files.
> >
> > So, let's add a pointer to struct commit_graph to the topological level
> > commit-slab and populate it with topological levels while writing a
> > split commit-graph.
> 
> All right, looks sensible.

I have extend the last paragraph to include write_commit_graph_context
as well as:

  So, let's add a pointer to struct commit_graph as well as struct
  write_commit_graph_context to the topological level commit-slab and
  populate it with topological levels while writing a commit-graph file.

> 
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@xxxxxxxxx>
> > ---
> >  commit-graph.c | 47 ++++++++++++++++++++++++++++++++---------------
> >  commit-graph.h |  1 +
> >  commit.h       |  1 +
> >  3 files changed, 34 insertions(+), 15 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 7f9f858577..a2f15b2825 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
> >  /* Remember to update object flag allocation in object.h */
> >  #define REACHABLE       (1u<<15)
> >
> > +define_commit_slab(topo_level_slab, uint32_t);
> > +
> 
> All right.
> 
> Also, here we might need GENERATION_NUMBER_V1_INFINITY, but I don't
> think it would be necessary.
> 
> >  /* Keep track of the order in which commits are added to our list. */
> >  define_commit_slab(commit_pos, int);
> >  static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
> > @@ -759,6 +761,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> >  	item->date = (timestamp_t)((date_high << 32) | date_low);
> >
> >  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > +
> > +	if (g->topo_levels)
> > +		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
> >  }
> 
> All right, here we store topological levels on commit-slab to avoid
> recomputing them.
> 
> Do I understand it correctly that the `topo_levels` member of the `struct
> commit_graph` would be non-null only when we are updating the
> commit-graph?
> 

Yes, that's correct.

> >
> >  static inline void set_commit_tree(struct commit *c, struct tree *t)
> > @@ -953,6 +958,7 @@ struct write_commit_graph_context {
> >  		 changed_paths:1,
> >  		 order_by_pack:1;
> >
> > +	struct topo_level_slab *topo_levels;
> >  	const struct split_commit_graph_opts *split_opts;
> >  	size_t total_bloom_filter_data_size;
> >  	const struct bloom_filter_settings *bloom_settings;
> 
> Why do we need `topo_levels` member *both* in `struct commit_graph` and
> in `struct write_commit_graph_context`?
> 
> [After examining the change further I have realized why both are needed,
>  and written about the reasoning later in this email.]
> 
> 
> Note that the commit message talks only about `struct commit_graph`...
> 
> > @@ -1094,7 +1100,7 @@ static int write_graph_chunk_data(struct hashfile *f,
> >  		else
> >  			packedDate[0] = 0;
> >
> > -		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
> > +		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
> 
> All right, here we prepare for writing to the CDAT chunk using data that
> is now stored on newly introduced topo_levels slab (either computed, or
> taken from commit-graph file being rewritten).
> 
> Assuming that ctx->topo_levels is not-null, and that the values are
> properly calculated before this -- and we did compute topological levels
> before writing the commit-graph.
> 
> >
> >  		packedDate[1] = htonl((*list)->date);
> >  		hashwrite(f, packedDate, 8);
> > @@ -1335,11 +1341,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  					_("Computing commit graph generation numbers"),
> >  					ctx->commits.nr);
> >  	for (i = 0; i < ctx->commits.nr; i++) {
> > -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> > +		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> 
> All right, so that is why this 'generation' variable was not converted
> to timestamp_t type.
> 
> >
> >  		display_progress(ctx->progress, i + 1);
> > -		if (generation != GENERATION_NUMBER_V1_INFINITY &&
> > -		    generation != GENERATION_NUMBER_ZERO)
> > +		if (level != GENERATION_NUMBER_V1_INFINITY &&
> > +		    level != GENERATION_NUMBER_ZERO)
> >  			continue;
> 
> Here we use GENERATION_NUMBER*_INFINITY to check if the commit is
> outside commit-graph files, and therefore we would need its topological
> level computed.
> 
> However, I don't understand how it works.  We have had created the
> commit_graph_data_at() and use it instead of commit_graph_data_slab_at()
> to provide default values for `struct commit_graph`... but only for
> `graph_pos` member.  It is commit_graph_generation() that returns
> GENERATION_NUMBER_INFINITY for commits not in graph.
> 
> But neither commit_graph_data_at()->generation nor topo_level_slab_at()
> handles this special case, so I don't see how 'generation' variable can
> *ever* be GENERATION_NUMBER_INFINITY, and 'level' variable can ever be
> GENERATION_NUMBER_V1_INFINITY for commits not in graph.
> 
> Does it work *accidentally*, because the default value for uninitialized
> data on commit-slab is 0, which matches GENERATION_NUMBER_ZERO?  It
> certainly looks like it does.  And GENERATION_NUMBER_ZERO is an artifact
> of commit-graph feature development history, namely the short time where
> Git didn't use any generation numbers and stored 0 in the place set for
> it in the commit-graph format...  On the other hand this is not the case
> for corrected commit date (generation number v2), as it could
> "legitimately" be 0 if some root commit (without any parents) had
> committerdate of epoch 0, i.e. 1 January 1970 00:00:00 UTC, perhaps
> caused by malformed but valid commit object.
> 
> Ugh...

It works accidentally.

Our decision to avoid the cost of initializing both
commit_graph_data->generation and commit_graph_data->graph_pos has
led to some unwieldy choices - the complexity of helper functions,
bypassing helper functions when writing a commit-graph file [1].

I want to re-visit how commit_graph_data slab is defined in a future series.

[1]: https://lore.kernel.org/git/be28ab7b-0ae4-2cc5-7f2b-92075de3723a@xxxxxxxxx/

> 
> >
> >  		commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1347,29 +1353,27 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  			struct commit *current = list->item;
> >  			struct commit_list *parent;
> >  			int all_parents_computed = 1;
> > -			uint32_t max_generation = 0;
> > +			uint32_t max_level = 0;
> >
> >  			for (parent = current->parents; parent; parent = parent->next) {
> > -				generation = commit_graph_data_at(parent->item)->generation;
> > +				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> >
> > -				if (generation == GENERATION_NUMBER_V1_INFINITY ||
> > -				    generation == GENERATION_NUMBER_ZERO) {
> > +				if (level == GENERATION_NUMBER_V1_INFINITY ||
> > +				    level == GENERATION_NUMBER_ZERO) {
> >  					all_parents_computed = 0;
> >  					commit_list_insert(parent->item, &list);
> >  					break;
> > -				} else if (generation > max_generation) {
> > -					max_generation = generation;
> > +				} else if (level > max_level) {
> > +					max_level = level;
> >  				}
> >  			}
> 
> This is the same case as for previous chunk; see the comment above.
> 
> This code checks if parents have generation number / topological level
> computed, and tracks maximum value of it among all parents.
> 
> >
> >  			if (all_parents_computed) {
> > -				struct commit_graph_data *data = commit_graph_data_at(current);
> > -
> > -				data->generation = max_generation + 1;
> >  				pop_commit(&list);
> >
> > -				if (data->generation > GENERATION_NUMBER_MAX)
> > -					data->generation = GENERATION_NUMBER_MAX;
> > +				if (max_level > GENERATION_NUMBER_MAX - 1)
> > +					max_level = GENERATION_NUMBER_MAX - 1;
> > +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> 
> OK, this is safer way of handling GENERATION_NUMBER*_MAX, especially if
> this value can be maximum value that can be safely stored in a given
> data type.  Previously GENERATION_NUMBER_MAX was smaller than maximum
> value that can be safely stored in uint32_t, so generation+1 had no
> chance to overflow.  This is no longer the case; the reorganization done
> here leads to more defensive code (safer).
> 
> All good.  However I think that we should clamp the value of topological
> level to the maximum value that can be safely stored *on disk*, in the
> 30 bits of the CDAT chunk reserved for generation number v1.  Otherwise
> the code to write topological level would get more complicated.
> 
> In my opinion the symbolic constant used here should be named
> GENERATION_NUMBER_V1_MAX, and its value should be at most (2 ^ 30 - 1);
> it should be the current value of GENERATION_NUMBER_MAX, that is
> 0x3FFFFFFF.
> 
> >  			}
> >  		}
> >  	}
> > @@ -2101,6 +2105,7 @@ int write_commit_graph(struct object_directory *odb,
> >  	uint32_t i, count_distinct = 0;
> >  	int res = 0;
> >  	int replace = 0;
> > +	struct topo_level_slab topo_levels;
> >
> 
> All right, we will be using topo_level slab for writing the
> commit-graph, and only for this purpose, so it is good to put it here.
> 
> >  	if (!commit_graph_compatible(the_repository))
> >  		return 0;
> > @@ -2179,6 +2184,18 @@ int write_commit_graph(struct object_directory *odb,
> >  		}
> >  	}
> >
> > +	init_topo_level_slab(&topo_levels);
> > +	ctx->topo_levels = &topo_levels;
> > +
> > +	if (ctx->r->objects->commit_graph) {
> > +		struct commit_graph *g = ctx->r->objects->commit_graph;
> > +
> > +		while (g) {
> > +			g->topo_levels = &topo_levels;
> > +			g = g->base_graph;
> > +		}
> > +	}
> 
> All right, now I see why we need `topo_levels` member both in the
> `struct write_commit_graph_context` and in `struct commit_graph`.
> The former is for functions that write the commit-graph, the latter for
> fill_commit_graph_info() functions that is deep in the callstack, but it
> needs to know whether to load topological level to commit-slab, or maybe
> put it as generation number (and in the future -- discard it, if not
> needed).
> 
> 
> Sidenote: this fragment of code, that fills with a given value some
> member of the `struct commit_graph` throughout the split commit-graph
> chain, will be repeated as similar code in patches later in series.
> However without resorting to preprocessor macros I have no idea how to
> generalize it to avoid code duplication (well, almost).
> 

The pattern is: iterate over the commit-graph chain and assign a member
(here, topo_level and in the other patch, read_generation_data) a value
(the address of topo_level slab, 1/0 depending on whether it is a mixed
generation chain).

We could generalize this in a future series but I don't think it is
worthwhile.

> > +
> >  	if (pack_indexes) {
> >  		ctx->order_by_pack = 1;
> >  		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
> > diff --git a/commit-graph.h b/commit-graph.h
> > index 430bc830bb..1152a9642e 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -72,6 +72,7 @@ struct commit_graph {
> >  	const unsigned char *chunk_bloom_indexes;
> >  	const unsigned char *chunk_bloom_data;
> >
> > +	struct topo_level_slab *topo_levels;
> >  	struct bloom_filter_settings *bloom_filter_settings;
> >  };
> 
> All right: `struct commit_graph` is public, `struct
> write_commit_graph_context` is not.
> 
> >
> > diff --git a/commit.h b/commit.h
> > index bc0732a4fe..bb846e0025 100644
> > --- a/commit.h
> > +++ b/commit.h
> > @@ -15,6 +15,7 @@
> >  #define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
> >  #define GENERATION_NUMBER_MAX 0x3FFFFFFF
> 
> The name GENERATION_NUMBER_MAX for 0x3FFFFFFF should be instead
> GENERATION_NUMBER_V1_MAX, but that may be done in a later commit.
> 
> >  #define GENERATION_NUMBER_ZERO 0
> > +#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF
> 
> This value is never used, so why it is defined in this commit.
> 

Moved down to the patch actually uses it.

> >
> >  struct commit_list {
> >  	struct commit *item;
> 
> Best,
> -- 
> Jakub Narębski



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux