[GSoC][Proposal v2] Implement Generation Number v2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

Thanks to Jakub for the discussion on differentiating between generation
numbers and the review.

Changes in v2:
- Add section on distinguishing generation numbers v1 and v2.
- Update 'Plan' with new GSoC timeline.
- Fix incorrect attribution of devising backwards compatible corrected
  commit dates to Jakub.

You can alternatively read this with markdown rendered:
https://github.com/abhishekkumar2718/GSoC20/blob/master/generation_number_v2.md

Thanks
Abhishek
---

# Implement Generation Number v2

## Abstract

Git uses various clever methods for making operations on very large repositories
faster, from bitmap indices for git fetch [1], to generation numbers (also known
as topological levels) in the commit-graph file for commit graph traversal
operations like git log --graph [2].

However, generation numbers do not always improve performance. Stolee has
previously explored alternatives for a better generation number [3]. Backward
compatible corrected commit date was chosen because of its performance, local
computability, and backward compatibility.

[1]: https://githubengineering.com/counting-objects/
[2]: https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/
[3]: https://lore.kernel.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@xxxxxxxxx/

## Goals

This project aims to:
- Move `generation`, `graph_pos` members of struct commit into a commit slab.
- Examine methods to distinguish commit-graph with legacy generation number and
generation number v2.
- Update generation number to v2, modifying all places that reference or
compute generation numbers.
- Use generation numbers to speed up implementation of `--ancestry-path` and
possibly other commands.
- Extend Stolee`s performance test suite.
- Examine performance improvements.

## Overview

Git walks the commit graph for many reasons, including listing and filtering
commit history and computing merge bases.

These operations can become slow as the commit count grows. The two main costs
here are decompressing and parsing commits and walking the entire graphs to
satisfy topological order constraints.

Stolee introduced commit-graph file to accelerate this process. The commit-graph
file stores commit graph structure along extra metadata to speed up graph walks.
A chunk includes commit OID, list of parents, commit date, root tree OID and 
generation number.

Generation number v1 was defined as one more than the length of the longest path
from A to the root commit. Then,

    If A and B are commits with generation numbers N and M, and N <= M, then
    A cannot reach B.

This property reduces the time to walk commits and determine topological
relationships significantly.

### Performance Regression

While testing git v2.19.0-rc1, Stolee came across a performance regression due
to using topological levels [4]. The discussion can be summarized as:

Topological level changes the order in which the commits are inspected and
gives a performance regression if it chooses to walk a longer path than
compared to date-based heuristic.

For example, consider the performance before and after using topological levels
on `git merge-base v4.8 v4.9`:

  before topological levels: 0.122s
   after topological levels: 0.547s


The topology of this can be described as:
```
   v4.9
    |  \
    |   \
   v4.8  \
    | \   \
    |  \   |
   ...  A  B
    |  /  /
    | /  /
    |/__/
    C
```

Here, the "..." means a "very long line of commits". A and B have generation
one more than C.

If we order commits by generation, then commit date, B is removed from the queue
only after many commits reachable from v4.8 have been explored.  However, if
ordering by commit date removes A and B from the queue early and C is inserted.
As C is processed, the rest of queue is marked as STALE and loop terminates.

[4]: https://lore.kernel.org/git/pull.28.git.gitgitgadget@xxxxxxxxx/

### Alternatives to Topological Level

To improve generation numbers, Stolee investigated other alternative
definitions for generation numbers. The extended exploration and report are
available here [5].

Corrected commit date was chosen as the next generation number by overall
consensus because of its performance, local computability, and immutability.

However, it was not backward compatible with current implementation i.e., older
clients do not report correct values if they ignore reachability index version.
Existing clients would fail altogether with commit-graph v2 instead of
switching commit-graph feature off, as discovered by Ævar Arnfjörð Bjarmason [6].

[5]: https://lore.kernel.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@xxxxxxxxx/
[6]: https://lore.kernel.org/git/87a7gdspo4.fsf@xxxxxxxxxxxxxxxxxxx/

### Corrected Commit Date With Strictly Monotonic Offset

Corrected commit date was later modified to _Corrected Commit Date With
Strictly Monotonic Offset_, defined as follows [7]:

For a commit C, let its _offset commit date_ (denoted by odate(C))
be a commit date plus some offset, i.e. odate(C) = date(C) + offset(C),
such that:

1. Offset commit date is greater than the maximum of the commit date of
   C and the offset commit dates of its parents.

    If odate(A) < odate(B), then A cannot reach B.

2. The offset of a commit is one more than the maximum offset of a parent, or
   more.

    If offset(A) < offset(B), then A cannot reach B.

Since the offset are strictly greater than the offset of a parent, the old
clients give correct values for the odate as well. `git commit-graph verify`
would fail since the offsets are not generation numbers in all cases.

This is an acceptable trade off because users can re-write the commit graph
after verify fails.
 
A rough implementation by Jakub can be read here [8].

[7]: https://lore.kernel.org/git/86o8ziatb2.fsf_-_@xxxxxxxxx/
[8]: https://github.com/derrickstolee/git/pull/10

## Distinguishing Generation Numbers

As explained above, commit-graph format version cannot be used to distinguish
between generation numbers. We want new Git to recognize if commit-graph uses
generation v1, use those generation numbers correctly and update commit-graph
file. The solutions suggested are:

- metadata / versioning chunk: Introduce an optional chunk to store generation
number as metadata and store offsets in CDAT. For a chain of commit-graph files
with both versions, best choice would be flatten the chain and recompute values
from scratch.

- new chunk for generation number: Introduce an optional chunk to store the
offset. Fill in CDAT with either generation zero or compute topological level.
This removes the restriction of storing generation number in CDAT. However,
since Git lazily parses commit data, seperating generation number from CDAT
would increase time spent in I/O.

Metadata chunk seems like the better solution for corrected commit date.
However, generation number chunk should be revisited if later versions of
generation numbers require more than 30 bits.

## Plan

### Community bonding period (May 4 - June 1)

1. Extend Stolee`s performance test suite.

To investigate replacements for generation numbers, Stolee implemented
prototypes and compared performance on a range of queries.

It currently tests:
- `git log --topo-order -N`: measures commits walked to determine all possible
ways of arriving at given commit.
- `git log --topo-order -10 A..B`: measures commits walked to determine commits
reachable from B but not from A.
- `git merge-base A B`: measures commits walked to find the lowest (closest)
ancestor for both A and B.

We can also add tests for:
- `git branch --[no-]contains A`: measures commits walked to determine if A is
reachable from the tip of a given branch for all branches (many to one query).
- `git branch --[no-]merged A`: measures commits walked to determine if the tip
of a given branch is reachable from A for all branches (one to many query).
- `git push/fetch`: measures commits walked to compute HAVES/WANTS of a client
(many to many query) [9].

2. Find other operations that can be optimized using generation numbers.

As Jakub pointed out in the discussion on generation numbers, --ancestry-path
can be optimized using generation numbers [10]. Discuss and find any other
operation which could use generation numbers.

3. Explore commit, commit-graph, commit-reach, revision

[9]: https://github.blog/2015-09-22-counting-objects/
[10]: https://lore.kernel.org/git/861s82wdbp.fsf@xxxxxxxxx/

### Move `generation`, `graph_pos` into a slab (June 1 - June 15, 2 weeks)

struct commit is used in many contexts. However, the struct members
`generation` and `graph_pos` are only used for commit-graph related operations
and otherwise waste memory.

Using a commit slab reduces space of an individual commit object, which becomes
crucial as we load thousands of them. We can dynamically associate commits
with their generation/graph positions when needed.

Commit slab for `generation` is 64 bits wide as it stores odate instead
of offset [11]. Commit slab for `graph_pos` is simpler, with a straightforward
allocation of 32 bits.

Note: Commit-graph chunks would still store offset and will be used to compute
corrected date on initialization.

[11]: https://lore.kernel.org/git/cfa2c367-5cd7-add5-0293-caa75b103f34@xxxxxxxxx/

### Implement backward compatible corrected commit date (June 15 - August 2, 7 weeks)

1. Implement metadata chunk (June 15 - June 29, 2 week)

2. Implement backward compatible corrected commit date (June 29 - July 13, 2 weeks)

Topological levels are computed using a DFS in `compute_generation_numbers` of
commit-graph.c. The condition

  generation(C) = max(generation(P)) + 1,
  where C is a commit and P are its parent(s).

needs to be redefined as:

  generation(C) = max(max(generation(P)), correction(C)) + 1,
  where correction(C) is max(cdate(P) - cdate(C), 0). 

Definitions of `GENERATION_NUMBER_{INFINITY, MAX, ZERO}` remain the same.

The functions `compare_commits_by_gen_then_commit_date` and
`compare_commits_by_commit_date` can be replaced by `compare_commits_by_gen` as
all three *should* return the same order.

--> First evaluation (June 29 - July 3)

--> End Semester Exams (June 29 - July 4)

3. Update existing references (July 13 - August 2 , 3 weeks)

The following functions use generation number or committer date for cutoffs:
- `contains_tag_algo`
- `can_all_from_reach_with_flag`
- `paint_down_to_common`
- `remove_redundant`

All of these can be updated with little to no changes.

--> Second evaluation (July 27 - July 31)

### Extend generation number to --ancestry-path (August 3 - August 16, 2 weeks)

`--ancestry-path A..B` returns the commits that are both descendants of A and
ancestors of B. The algorithm computes commits that are reachable by B but not
by A, reverse the graph and walk backwards from A [12].

We can use generation number to limit walks to find commits reachable by B.

[12]: https://lore.kernel.org/git/0c6b42e4-e825-ff70-a528-e118abf4c435@xxxxxxxxx/

### Examine performance improvements (August 17 - August 24, 1 week)

- Compare the performance of generation number v2 versus topological level using
  performance test suite.
- Wrap up any existing patches.

The performance report would be a nice summary my work over the summer.

## Contributions

[Microproject] Consolidate test_cmp_graph logic
-----
Log graph comparison logic is duplicated many times. This patch consolidates
comparison and sanitation logic in lib-log-graph.

Status: Merged

Patch: https://lore.kernel.org/git/20200216134750.18947-1-abhishekkumar8222@xxxxxxxxx/

Commit: https://github.com/git/git/commit/46703057c1a0f85e24c0144b38c226c6a9ccb737

I have also reviewed patches and discussed queries with other contributors:
- https://lore.kernel.org/git/CAHk66fskrfcJ0YFDhfimVBTJZB4um7r=GdQuM8heJdZtF8D7UQ@xxxxxxxxxxxxxx/
- https://lore.kernel.org/git/CAHk66fvt-1RaLK8E7SDpocWM9OMAcA-gP5hjHq6r5N_FbATNgA@xxxxxxxxxxxxxx/
- https://github.com/git/git/pull/647#issuecomment-591978405

and others.

## Post GSoC

I would love to keep contributing to git after the GSoC period ends. There`s so
much to learn from the community.

Hannes`s comment on checks as a penalty that should be paid only by constant
strbufs was a perspective I had not considered [13].

Interacting with Kyagi made me rethink the justifications _emphasizing commit
messages_. I was at my wit`s end, which makes me appreciate my patient mentors
more and want to give back to the community.

[13]: https://lore.kernel.org/git/467c035f-c7cd-01e1-e64c-2c915610de01@xxxxxxxx/

## About Me

I am Abhishek Kumar, a second-year CSE student at National Institute of
Technology Karnataka, India. I have a blog where I talk about my interests -
programming, fiction, and literature [14].

I primarily work with C/C++ and Ruby on Rails. I am a member of my institute`s
Open Source Club and student-built University Management System, _IRIS_. I have
some experience of mentoring - Creating their code style guide and being an
active reviewer [15].

[14]: https://abhishekkumar2718.github.io/

[15]: https://iris.nitk.ac.in/about_us

## Availablity

The official GSoC coding period runs from June 1 to August 24.

Due to the outbreak of COVID-19 in my country, my college has pre-emptively
announced summer vacations from March 17 to June 1. Unfortunately,
I would have classes for a large part of the coding period. However, I can still
contribute 35-40 hours every week due to a low course load (~20 hours a week).

I would not be able to contribute from June 29 to July 4 due to end semester
exams.  It would be easily compensated during the subsequent "semester break"
from July 5 to July 27.

## Contact Information

```
     Name: Abhishek Kumar
    Email: abhishekkumar8222@xxxxxxxxx
    Major: Computer Science And Engineering
   Degree: Bachelor of Technology
Institute: National Institute of Technology Karnataka
   Github: abhishekkumar2718
 Timezone: UTC+5:30 (IST)
```

Thank you for taking the time to review my proposal!



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux