Re: [PATCH v3 06/11] commit-graph: add a slab to store topological levels

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> From: Abhishek Kumar <abhishekkumar8222@xxxxxxxxx>
>
> As we are writing topological levels to commit data chunk to ensure
> backwards compatibility with "Old" Git and the member `generation` of
> struct commit_graph_data will store corrected commit date in a later
> commit, let's introduce a commit-slab to store topological levels while
> writing commit-graph.

In my opinion the above it would be easier to follow if rephrased in the
following way:

  In a later commit we will introduce corrected commit date as the
  generation number v2.  This value will be stored in the new separate
  GDAT chunk.  However to ensure backwards compatibility with "Old" Git
  we need to continue to write generation number v1, which is
  topological level, to the commit data chunk (CDAT).  This means that
  we need to compute both versions of generation numbers when writing
  the commit-graph file.  Let's therefore introduce a commit-slab
  to store topological levels; corrected commit date will be stored
  in the member `generation` of struct commit_graph_data.

What do you think?


By the way, do I understand it correctly that in backward-compatibility
mode (that is, in mixed-version environment where at least some
commit-graph files were written by "Old" Git and are lacking GDAT chunk
and generation number v2 data) the `generation` member of commit graph
data chunk will be populated and will store generation number v1, that
is topological level? And that the commit-slab for topological levels is
only there for writing and re-writing?

>
> When Git creates a split commit-graph, it takes advantage of the
> generation values that have been computed already and present in
> existing commit-graph files.
>
> So, let's add a pointer to struct commit_graph to the topological level
> commit-slab and populate it with topological levels while writing a
> split commit-graph.

All right, looks sensible.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@xxxxxxxxx>
> ---
>  commit-graph.c | 47 ++++++++++++++++++++++++++++++++---------------
>  commit-graph.h |  1 +
>  commit.h       |  1 +
>  3 files changed, 34 insertions(+), 15 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 7f9f858577..a2f15b2825 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
>  /* Remember to update object flag allocation in object.h */
>  #define REACHABLE       (1u<<15)
>
> +define_commit_slab(topo_level_slab, uint32_t);
> +

All right.

Also, here we might need GENERATION_NUMBER_V1_INFINITY, but I don't
think it would be necessary.

>  /* Keep track of the order in which commits are added to our list. */
>  define_commit_slab(commit_pos, int);
>  static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
> @@ -759,6 +761,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>
>  	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +
> +	if (g->topo_levels)
> +		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
>  }

All right, here we store topological levels on commit-slab to avoid
recomputing them.

Do I understand it correctly that the `topo_levels` member of the `struct
commit_graph` would be non-null only when we are updating the
commit-graph?

>
>  static inline void set_commit_tree(struct commit *c, struct tree *t)
> @@ -953,6 +958,7 @@ struct write_commit_graph_context {
>  		 changed_paths:1,
>  		 order_by_pack:1;
>
> +	struct topo_level_slab *topo_levels;
>  	const struct split_commit_graph_opts *split_opts;
>  	size_t total_bloom_filter_data_size;
>  	const struct bloom_filter_settings *bloom_settings;

Why do we need `topo_levels` member *both* in `struct commit_graph` and
in `struct write_commit_graph_context`?

[After examining the change further I have realized why both are needed,
 and written about the reasoning later in this email.]


Note that the commit message talks only about `struct commit_graph`...

> @@ -1094,7 +1100,7 @@ static int write_graph_chunk_data(struct hashfile *f,
>  		else
>  			packedDate[0] = 0;
>
> -		packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
> +		packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);

All right, here we prepare for writing to the CDAT chunk using data that
is now stored on newly introduced topo_levels slab (either computed, or
taken from commit-graph file being rewritten).

Assuming that ctx->topo_levels is not-null, and that the values are
properly calculated before this -- and we did compute topological levels
before writing the commit-graph.

>
>  		packedDate[1] = htonl((*list)->date);
>  		hashwrite(f, packedDate, 8);
> @@ -1335,11 +1341,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					_("Computing commit graph generation numbers"),
>  					ctx->commits.nr);
>  	for (i = 0; i < ctx->commits.nr; i++) {
> -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> +		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);

All right, so that is why this 'generation' variable was not converted
to timestamp_t type.

>
>  		display_progress(ctx->progress, i + 1);
> -		if (generation != GENERATION_NUMBER_V1_INFINITY &&
> -		    generation != GENERATION_NUMBER_ZERO)
> +		if (level != GENERATION_NUMBER_V1_INFINITY &&
> +		    level != GENERATION_NUMBER_ZERO)
>  			continue;

Here we use GENERATION_NUMBER*_INFINITY to check if the commit is
outside commit-graph files, and therefore we would need its topological
level computed.

However, I don't understand how it works.  We have had created the
commit_graph_data_at() and use it instead of commit_graph_data_slab_at()
to provide default values for `struct commit_graph`... but only for
`graph_pos` member.  It is commit_graph_generation() that returns
GENERATION_NUMBER_INFINITY for commits not in graph.

But neither commit_graph_data_at()->generation nor topo_level_slab_at()
handles this special case, so I don't see how 'generation' variable can
*ever* be GENERATION_NUMBER_INFINITY, and 'level' variable can ever be
GENERATION_NUMBER_V1_INFINITY for commits not in graph.

Does it work *accidentally*, because the default value for uninitialized
data on commit-slab is 0, which matches GENERATION_NUMBER_ZERO?  It
certainly looks like it does.  And GENERATION_NUMBER_ZERO is an artifact
of commit-graph feature development history, namely the short time where
Git didn't use any generation numbers and stored 0 in the place set for
it in the commit-graph format...  On the other hand this is not the case
for corrected commit date (generation number v2), as it could
"legitimately" be 0 if some root commit (without any parents) had
committerdate of epoch 0, i.e. 1 January 1970 00:00:00 UTC, perhaps
caused by malformed but valid commit object.

Ugh...

>
>  		commit_list_insert(ctx->commits.list[i], &list);
> @@ -1347,29 +1353,27 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			struct commit *current = list->item;
>  			struct commit_list *parent;
>  			int all_parents_computed = 1;
> -			uint32_t max_generation = 0;
> +			uint32_t max_level = 0;
>
>  			for (parent = current->parents; parent; parent = parent->next) {
> -				generation = commit_graph_data_at(parent->item)->generation;
> +				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
>
> -				if (generation == GENERATION_NUMBER_V1_INFINITY ||
> -				    generation == GENERATION_NUMBER_ZERO) {
> +				if (level == GENERATION_NUMBER_V1_INFINITY ||
> +				    level == GENERATION_NUMBER_ZERO) {
>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> -				} else if (generation > max_generation) {
> -					max_generation = generation;
> +				} else if (level > max_level) {
> +					max_level = level;
>  				}
>  			}

This is the same case as for previous chunk; see the comment above.

This code checks if parents have generation number / topological level
computed, and tracks maximum value of it among all parents.

>
>  			if (all_parents_computed) {
> -				struct commit_graph_data *data = commit_graph_data_at(current);
> -
> -				data->generation = max_generation + 1;
>  				pop_commit(&list);
>
> -				if (data->generation > GENERATION_NUMBER_MAX)
> -					data->generation = GENERATION_NUMBER_MAX;
> +				if (max_level > GENERATION_NUMBER_MAX - 1)
> +					max_level = GENERATION_NUMBER_MAX - 1;
> +				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;

OK, this is safer way of handling GENERATION_NUMBER*_MAX, especially if
this value can be maximum value that can be safely stored in a given
data type.  Previously GENERATION_NUMBER_MAX was smaller than maximum
value that can be safely stored in uint32_t, so generation+1 had no
chance to overflow.  This is no longer the case; the reorganization done
here leads to more defensive code (safer).

All good.  However I think that we should clamp the value of topological
level to the maximum value that can be safely stored *on disk*, in the
30 bits of the CDAT chunk reserved for generation number v1.  Otherwise
the code to write topological level would get more complicated.

In my opinion the symbolic constant used here should be named
GENERATION_NUMBER_V1_MAX, and its value should be at most (2 ^ 30 - 1);
it should be the current value of GENERATION_NUMBER_MAX, that is
0x3FFFFFFF.

>  			}
>  		}
>  	}
> @@ -2101,6 +2105,7 @@ int write_commit_graph(struct object_directory *odb,
>  	uint32_t i, count_distinct = 0;
>  	int res = 0;
>  	int replace = 0;
> +	struct topo_level_slab topo_levels;
>

All right, we will be using topo_level slab for writing the
commit-graph, and only for this purpose, so it is good to put it here.

>  	if (!commit_graph_compatible(the_repository))
>  		return 0;
> @@ -2179,6 +2184,18 @@ int write_commit_graph(struct object_directory *odb,
>  		}
>  	}
>
> +	init_topo_level_slab(&topo_levels);
> +	ctx->topo_levels = &topo_levels;
> +
> +	if (ctx->r->objects->commit_graph) {
> +		struct commit_graph *g = ctx->r->objects->commit_graph;
> +
> +		while (g) {
> +			g->topo_levels = &topo_levels;
> +			g = g->base_graph;
> +		}
> +	}

All right, now I see why we need `topo_levels` member both in the
`struct write_commit_graph_context` and in `struct commit_graph`.
The former is for functions that write the commit-graph, the latter for
fill_commit_graph_info() functions that is deep in the callstack, but it
needs to know whether to load topological level to commit-slab, or maybe
put it as generation number (and in the future -- discard it, if not
needed).


Sidenote: this fragment of code, that fills with a given value some
member of the `struct commit_graph` throughout the split commit-graph
chain, will be repeated as similar code in patches later in series.
However without resorting to preprocessor macros I have no idea how to
generalize it to avoid code duplication (well, almost).

> +
>  	if (pack_indexes) {
>  		ctx->order_by_pack = 1;
>  		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
> diff --git a/commit-graph.h b/commit-graph.h
> index 430bc830bb..1152a9642e 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -72,6 +72,7 @@ struct commit_graph {
>  	const unsigned char *chunk_bloom_indexes;
>  	const unsigned char *chunk_bloom_data;
>
> +	struct topo_level_slab *topo_levels;
>  	struct bloom_filter_settings *bloom_filter_settings;
>  };

All right: `struct commit_graph` is public, `struct
write_commit_graph_context` is not.

>
> diff --git a/commit.h b/commit.h
> index bc0732a4fe..bb846e0025 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -15,6 +15,7 @@
>  #define GENERATION_NUMBER_V1_INFINITY 0xFFFFFFFF
>  #define GENERATION_NUMBER_MAX 0x3FFFFFFF

The name GENERATION_NUMBER_MAX for 0x3FFFFFFF should be instead
GENERATION_NUMBER_V1_MAX, but that may be done in a later commit.

>  #define GENERATION_NUMBER_ZERO 0
> +#define GENERATION_NUMBER_V2_OFFSET_MAX 0xFFFFFFFF

This value is never used, so why it is defined in this commit.

>
>  struct commit_list {
>  	struct commit *item;

Best,
-- 
Jakub Narębski




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux