Re: [PATCH v4 10/10] commit-graph.txt: update design document

Jakub Narebski <jnareb@xxxxxxxxx> · Tue, 01 May 2018 01:32:46 +0200

Derrick Stolee <dstolee@xxxxxxxxxxxxx> writes:

> We now calculate generation numbers in the commit-graph file and use
> them in paint_down_to_common().
>
> Expand the section on generation numbers to discuss how the three
> special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
> _MAX interact with other generation numbers.
>
> Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>

Looks good.

> ---
>  Documentation/technical/commit-graph.txt | 30 +++++++++++++++++++-----
>  1 file changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index 0550c6d0dc..d9f2713efa 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite"
>  generation number and walk until reaching commits with known generation
>  number.
>  
> +We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> +in the commit-graph file. If a commit-graph file was written by a version
> +of Git that did not compute generation numbers, then those commits will
> +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> +
> +Since the commit-graph file is closed under reachability, we can guarantee
> +the following weaker condition on all commits:
> +
> +    If A and B are commits with generation numbers N amd M, respectively,
> +    and N < M, then A cannot reach B.
> +
> +Note how the strict inequality differs from the inequality when we have
> +fully-computed generation numbers. Using strict inequality may result in
> +walking a few extra commits,

The linux kernel commit graph has maximum of 513 commits sharing the
same generation number, but is is 5.43 commits sharing the same
generation number on average, with standard deviation 10.70; median is
even lower: it is 2, with 5.35 median absolute deviation (MAD).

So on average it would be a few extra commits.  Right.

>                               but the simplicity in dealing with commits
> +with generation number *_INFINITY or *_ZERO is valuable.

As I wrote before, handling those corner cases in more complicated, but
not that complicated.  We could simply use stronger condition if both
generation numbers are ordinary generation numbers, and weaker condition
when at least one generation number has one of those special values.

> +
> +We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> +generation numbers are computed to be at least this value. We limit at
> +this value since it is the largest value that can be stored in the
> +commit-graph file using the 30 bits available to generation numbers. This
> +presents another case where a commit can have generation number equal to
> +that of a parent.

Ordinary generation numbers, where stronger condition holds, are those
between GENERATION_NUMBER_ZERO < gen(C) < GENERATION_NUMBER_MAX.

> +
>  Design Details
>  --------------
>  
> @@ -98,17 +121,12 @@ Future Work
>  - The 'commit-graph' subcommand does not have a "verify" mode that is
>    necessary for integration with fsck.
>  
> -- The file format includes room for precomputed generation numbers. These
> -  are not currently computed, so all generation numbers will be marked as
> -  0 (or "uncomputed"). A later patch will include this calculation.
> -

Good.

>  - After computing and storing generation numbers, we must make graph
>    walks aware of generation numbers to gain the performance benefits they
>    enable. This will mostly be accomplished by swapping a commit-date-ordered
>    priority queue with one ordered by generation number. The following
> -  operations are important candidates:
> +  operation is an important candidate:
>  
> -    - paint_down_to_common()
>      - 'log --topo-order'

Another possible candidates:

       - remove_redundant() - see comment in previous patch
       - still_interesting() - where Git uses date slop to stop walking
         too far

>  
>  - Currently, parse_commit_gently() requires filling in the root tree

One important issue left is handling features that change view of
project history, and their interaction with commit-graph feature.

What would happen, if we turn on commit-graph feature, generate commit
graph file, and then:

  * use graft file or remove graft entries to cut history, or remove cut
    or join two [independent] histories.
  * use git-replace mechanims to do the same
  * in shallow clone, deepen or shorten the clone

What would happen if without re-generating commit-graph file (assuming
tha Git wouldn't do it for us), we run some feature that makes use of
commit-graph data:

  - git branch --contains
  - git tag --contains
  - git rev-list A..B

Best,
-- 
Jakub Narębski