Re: [PATCH v4 06/10] commit-graph: implement corrected commit date

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



"Abhishek Kumar via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> From: Abhishek Kumar <abhishekkumar8222@xxxxxxxxx>
>
> With most of preparations done, let's implement corrected commit date.
>
> The corrected commit date for a commit is defined as:
>
> * A commit with no parents (a root commit) has corrected commit date
>   equal to its committer date.
> * A commit with at least one parent has corrected commit date equal to
>   the maximum of its commit date and one more than the largest corrected
>   commit date among its parents.

All right.  We might want to say that it fulfills the same reachability
criteria as topological level, but perhaps this level of detail is not
necessary here.

> As a special case, a root commit with timestamp of zero (01.01.1970
> 00:00:00Z) has corrected commit date of one, to be able to distinguish
> from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
> date).

I'm not sure if this special case is really necessary, but it makes for
cleaner reasoning.

> To minimize the space required to store corrected commit date, Git
> stores corrected commit date offsets into the commit-graph file. The
> corrected commit date offset for a commit is defined as the difference
> between its corrected commit date and actual commit date.
>
> Storing corrected commit date requires sizeof(timestamp_t) bytes, which
> in most cases is 64 bits (uintmax_t). However, corrected commit date
> offsets can be safely stored using only 32-bits. This halves the size
> of GDAT chunk, which is a reduction of around 6% in the size of
> commit-graph file.
>
> However, using offsets be problematic if one of commits is malformed but
> valid and has committerdate of 0 Unix time, as the offset would be the
> same as corrected commit date and thus require 64-bits to be stored
> properly.
>
> While Git does not write out offsets at this stage, Git stores the
> corrected commit dates in member generation of struct commit_graph_data.
> It will begin writing commit date offsets with the introduction of
> generation data chunk.

All right.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@xxxxxxxxx>

Somewhere in the commit message we should also describe that this commit
changes how commit-graph is verified: from checking that the generation
number agrees with _topological level definition_, that is that for a
given commit it is 1 more than maximum of its parents (with the caveat
that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to
checking that slightly weaker condition fulfilled by both topological
levels (generation number v1) and by corrected commit date (generation
number v2) that for a given commit its generation number is 1 more than
maximum of its parents or larger.

But, as far as I understand it, current code does not handle correctly
GENERATION_NUMBER_V1_MAX case (if we use generation number v1).

On the other hand we could have simpy use functional check, that
generation number used (which can be v1 or v2, or any similar other)
fulfills the reachability condition for each edge, which can be
simplified to checking that generation(parents) <= generation(commit).
If the reachability condition is true for each edge, then it is true for
each path, and for each commit.

> ---
>  commit-graph.c | 43 +++++++++++++++++++++++--------------------
>  1 file changed, 23 insertions(+), 20 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index cedd311024..03948adfce 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	else if (generation_a > generation_b)
>  		return 1;
>  
> -	/* use date as a heuristic when generations are equal */
> -	if (a->date < b->date)
> -		return -1;
> -	else if (a->date > b->date)
> -		return 1;

Why this change?  It is not described in the commit message.

Note that while this tie-breaking fallback doesn't make much sense for
corrected committer date generation number v2, this tie-breaking helps
if we have to use topological levels (generation number v2).

>  	return 0;
>  }
>  
> @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					ctx->commits.nr);
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);

Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be
enough; no need for 'timestamp_t' for 'level' variable.

> +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
>

All right, we compute both generation numbers: topological levels and
corrected commit date.

I guess we use 'corrected_commit_date' instead of simply 'generation' to
make it asier to remember which is which.

>  		display_progress(ctx->progress, i + 1);
>  		if (level != GENERATION_NUMBER_INFINITY &&
> -		    level != GENERATION_NUMBER_ZERO)
> +		    level != GENERATION_NUMBER_ZERO &&
> +		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
> +		    corrected_commit_date != GENERATION_NUMBER_ZERO

Straightforward addition.

> +		    )

Why this closing parenthesis is now in separated line?

>  			continue;
>  
>  		commit_list_insert(ctx->commits.list[i], &list);
> @@ -1369,17 +1368,25 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			struct commit_list *parent;
>  			int all_parents_computed = 1;
>  			uint32_t max_level = 0;
> +			timestamp_t max_corrected_commit_date = 0;

All right, straightforward addition.

>  
>  			for (parent = current->parents; parent; parent = parent->next) {
>  				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> -

Why we have removed this empty line?

> +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;

All right.

>  				if (level == GENERATION_NUMBER_INFINITY ||
> -				    level == GENERATION_NUMBER_ZERO) {
> +				    level == GENERATION_NUMBER_ZERO ||
> +				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
> +				    corrected_commit_date == GENERATION_NUMBER_ZERO
> +				    ) {

All right, same as above.

>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> -				} else if (level > max_level) {
> -					max_level = level;
> +				} else {
> +					if (level > max_level)
> +						max_level = level;
> +
> +					if (corrected_commit_date > max_corrected_commit_date)
> +						max_corrected_commit_date = corrected_commit_date;
>  				}

All right, reasonable and straightforward.

>  			}
>  
> @@ -1389,6 +1396,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
>  					max_level = GENERATION_NUMBER_V1_MAX - 1;
>  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> +
> +				if (current->date && current->date > max_corrected_commit_date)
> +					max_corrected_commit_date = current->date - 1;
> +				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;

All right.

Here we use the same trick as in previous commit (and as above) to avoid
any possible overflow, to minimize number of conditionals.  The fact
that max_corrected_commit_date might store incorrect value doesn't
matter, as it is reset at beginning of this loop.

>  			}
>  		}
>  	}
> @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  		if (generation_zero == GENERATION_ZERO_EXISTS)
>  			continue;
>  
> -		/*
> -		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
> -		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
> -		 * extra logic in the following condition.
> -		 */
> -		if (max_generation == GENERATION_NUMBER_V1_MAX)
> -			max_generation--;
> -

Perhaps in the future we should check that both topological levels, and
also corrected committer date (if it exists) for correctness according
to their definition.  Then the above removed part would be restored (but
with s/max_generation/max_level/).

>  		generation = commit_graph_generation(graph_commit);
> -		if (generation != max_generation + 1)
> -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
> +		if (generation < max_generation + 1)
> +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),

All right, so we relaxed the check so that it will be fulfilled by
generation number v2 (and also by generation number v1, as it implies
the more strict check for v1).

What would happen however if generation holds topological levels, and it
is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is
GENERATION_NUMBER_V1_MAX for a commit?  As you can check, the condition
would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1,
so the `git commit-graph verify` would incorrectly say that there is
a problem with generation number, while there isn't one (false positive
detection of error).

Sidenote: I think we don't have to worry about having to introduce
GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from
disck representation) corrected commiter date is the same as of commiter
date itself, plus some, and I don't see us coming close to 64-bit limit
of timestamp_t for commit dates.

>  				     oid_to_hex(&cur_oid),
>  				     generation,
>  				     max_generation + 1);

Best,
-- 
Jakub Narębski




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux