[RFC][GSoC] Implement Generation Number v2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello everyone,

Following the discussions between Jakub and Stolee [16], it was decided that
implementing "Corrected Commit Date With Strictly Monotonic Offset" [17] was
more appropriate for a GSoC project.

I would like to work on it. The proposal could still use some work but I
thought it would be best to get early feedback.

Having been over most of the related codebase, I think the project can be
reasonably completed in 8 weeks or so.

Transitioning from SHA1 to SHA3-256 (or at least adding support for it) has
been on the waitlist for a while. Is that something we could pursue as a
part of this project?

If not, are there any other related components/ideas that I could work on.

Thank you Jakub for proposing this project and the other contributors for
continued interest and discussions.

Note: You can alternatively read this proposal on Github [18].

[16]: https://lore.kernel.org/git/86eeto5x1j.fsf@xxxxxxxxx/
[17]: https://lore.kernel.org/git/86o8ziatb2.fsf_-_@xxxxxxxxx/
[18]: https://github.com/abhishekkumar2718/GSoC20/blob/master/generation_number_v2.md

===============================================================================

# Implement Generation Number v2

## Abstract

Git uses various clever methods for making operations on very large repositories
faster, from bitmap indices for git fetch [1], to generation numbers (also known
as topological levels) in the commit-graph file for commit graph traversal
operations like git log --graph [2].

However, generation numbers do not always improve performance. Stolee has
previously explored alternatives for a better generation number [3]. Backward
compatible corrected commit date was chosen because of its performance, local
computability, and backward compatibility.

[1]: https://githubengineering.com/counting-objects/
[2]: https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/
[3]: https://lore.kernel.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@xxxxxxxxx/

## Goals

This project aims to:
- Move `generation`, `graph_pos` members of struct commit into a commit slab.
- Update generation number to v2, modifying all places that reference or
compute generation numbers.
- Use generation numbers to speed up implementation of `--ancestry-path` and
possibly other commands.
- Extend Stolee`s performance test suite.
- Examine performance improvements.

## Overview

Git walks the commit graph for many reasons, including listing and filtering
commit history and computing merge bases.

These operations can become slow as the commit count grows. The two main costs
here are decompressing and parsing commits and walking the entire graphs to
satisfy topological order constraints.

Stolee introduced commit-graph file to accelerate this process. The commit-graph
file stores commit graph structure along extra metadata to speed up graph walks.
A chunk includes commit OID, list of parents, commit date, root tree OID and 
generation number.

Generation number v1 was defined as one more than the length of the longest path
from A to the root commit. Then,

    If A and B are commits with generation numbers N and M, and N <= M, then
    A cannot reach B.

This property reduces the time to walk commits and determine topological
relationships significantly.

### Performance Regression

While testing git v2.19.0-rc1, Stolee came across a performance regression due
to using topological levels [4]. The discussion can be summarized as:

Topological level changes the order in which the commits are inspected and
gives a performance regression if it chooses to walk a longer path than
compared to date-based heuristic.

For example, consider the performance before and after using topological levels
on `git merge-base v4.8 v4.9`:

  before topological levels: 0.122s
   after topological levels: 0.547s


The topology of this can be described as:
```
   v4.9
    |  \
    |   \
   v4.8  \
    | \   \
    |  \   |
   ...  A  B
    |  /  /
    | /  /
    |/__/
    C
```

Here, the "..." means a "very long line of commits". A and B have generation
one more than C.

If we order commits by generation, then commit date, B is removed from the queue
only after many commits reachable from v4.8 have been explored.  However, if
ordering by commit date removes A and B from the queue early and C is inserted.
As C is processed, the rest of queue is marked as STALE and loop terminates.

[4]: https://lore.kernel.org/git/pull.28.git.gitgitgadget@xxxxxxxxx/

### Alternatives to Topological Level

To improve generation numbers, Stolee investigated other alternative
definitions for generation numbers. The extended exploration and report are
available here [5].

Corrected commit date was chosen as the next generation number by overall
consensus because of its performance, local computability, and immutability.

However, it was not backward compatible with current implementation i.e., older
clients do not report correct values if they ignore reachability index version.
Existing clients would fail altogether with commit-graph v2 instead of
switching commit-graph feature off, as discovered by Ævar Arnfjörð Bjarmason [6].

[5]: https://lore.kernel.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@xxxxxxxxx/
[6]: https://lore.kernel.org/git/87a7gdspo4.fsf@xxxxxxxxxxxxxxxxxxx/

### Corrected Commit Date With Strictly Monotonic Offset

Jakub later proposed a modification to corrected commit date, _Corrected Commit
Date with Strictly Monotonic Offset_ defined as follows [7]:

For a commit C, let its _offset commit date_ (denoted by odate(C))
be a commit date plus some offset, i.e. odate(C) = date(C) + offset(C),
such that:

1. Offset commit date is greater than the maximum of the commit date of
   C and the offset commit dates of its parents.

    If odate(A) < odate(B), then A cannot reach B.

2. The offset of a commit is one more than the maximum offset of a parent, or
   more.

    If offset(A) < offset(B), then A cannot reach B.

Since the offset are strictly greater than the offset of a parent, the old
clients give correct values for the odate as well. `git commit-graph verify`
would fail since the offsets are not generation numbers in all cases.

This is an acceptable trade off because users can re-write the commit graph
after verify fails.
 
A rough implementation by Jakub can be read here [8].

[7]: https://lore.kernel.org/git/86o8ziatb2.fsf_-_@xxxxxxxxx/
[8]: https://github.com/derrickstolee/git/pull/10

## Plan

### Community bonding period (April 27 - May 18, 4 weeks)

1. Extend Stolee`s performance test suite.

To investigate replacements for generation numbers, Stolee implemented
prototypes and compared performance on a range of queries.

It currently tests:
- `git log --topo-order -N`: measures commits walked to determine all possible
ways of arriving at given commit.
- `git log --topo-order -10 A..B`: measures commits walked to determine commits
reachable from B but not from A.
- `git merge-base A B`: measures commits walked to find the lowest (closest)
ancestor for both A and B.

We can also add tests for:
- `git branch --[no-]contains A`: measures commits walked to determine if A is
reachable from the tip of a given branch for all branches (many to one query).
- `git branch --[no-]merged A`: measures commits walked to determine if the tip
of a given branch is reachable from A for all branches (one to many query).
- `git push/fetch`: measures commits walked to compute HAVES/WANTS of a client
(many to many query) [9].

2. Find other operations that can be optimized using generation numbers.

As Jakub pointed out in the discussion on generation numbers, --ancestry-path
can be optimized using generation numbers [10]. Discuss and find any other
operation which could use generation numbers.

3. Explore commit, commit-graph, commit-reach, revision

[9]: https://github.blog/2015-09-22-counting-objects/
[10]: https://lore.kernel.org/git/861s82wdbp.fsf@xxxxxxxxx/

### Move `generation`, `graph_pos` into a slab (May 18 - June 8, 3 weeks)

struct commit is used in many contexts. However, the struct members
`generation` and `graph_pos` are only used for commit-graph related operations
and otherwise waste memory.

Using a commit slab reduces space of an individual commit object, which becomes
crucial as we load thousands of them. We can dynamically associate commits
with their generation/graph positions when needed.

Commit slab for `generation` is 64 bits wide as it stores odate instead
of offset [11]. Commit slab for `graph_pos` is simpler, with a straightforward
allocation of 32 bits.

Note: Commit-graph chunks would still store offset and will be used to compute
corrected date on initialization.

[11]: https://lore.kernel.org/git/cfa2c367-5cd7-add5-0293-caa75b103f34@xxxxxxxxx/

### Implement backward compatible corrected commit date (June 8 - July 20, 6 weeks)

1. Implement backward compatible corrected commit date (June 8 - June 22, 2 weeks)

Topological levels are computed using a DFS in `compute_generation_numbers` of
commit-graph.c. The condition

  generation(C) = max(generation(P)) + 1,
  where C is a commit and P are its parent(s).

needs to be redefined as:

  generation(C) = max(max(generation(P)), correction(C)) + 1,
  where correction(C) is max(cdate(P) - cdate(C), 0). 

Definitions of `GENERATION_NUMBER_{INFINITY, MAX, ZERO}` remain the same.

The functions `compare_commits_by_gen_then_commit_date` and
`compare_commits_by_commit_date` can be replaced by `compare_commits_by_gen` as
all three *should* return the same order.

--> First evaluation (June 15 - June 19)

2. Update existing references (June 22 - July 20, 4 weeks)

The following functions use generation number or committer date for cutoffs:
- `contains_tag_algo`
- `can_all_from_reach_with_flag`
- `paint_down_to_common`
- `remove_redundant`

All of these can be updated with little to no changes.

--> End Semester Exams (June 29 - July 4)

--> Second evaluation (July 13 - July 17)

### Extend generation number to --ancestry-path (July 20 - August 3, 2 weeks)

`--ancestry-path A..B` returns the commits that are both descendants of A and
ancestors of B. The algorithm computes commits that are reachable by B but not
by A, reverse the graph and walk backwards from A [12].

We can use generation number to limit walks to find commits reachable by B.

[12]: https://lore.kernel.org/git/0c6b42e4-e825-ff70-a528-e118abf4c435@xxxxxxxxx/

### Examine performance improvements (August 3 - August 10, 1 week)

- Compare the performance of generation number v2 versus topological level using
  performance test suite.
- Wrap up any existing patches.

The performance report would be a nice summary my work over the summer.

## Contributions

[Microproject] Consolidate test_cmp_graph logic
-----
Log graph comparison logic is duplicated many times. This patch consolidates
comparison and sanitation logic in lib-log-graph.

Status: Merged

Patch: https://lore.kernel.org/git/20200216134750.18947-1-abhishekkumar8222@xxxxxxxxx/

Commit: https://github.com/git/git/commit/46703057c1a0f85e24c0144b38c226c6a9ccb737

I have also reviewed patches and discussed queries with other contributors:
- https://lore.kernel.org/git/CAHk66fskrfcJ0YFDhfimVBTJZB4um7r=GdQuM8heJdZtF8D7UQ@xxxxxxxxxxxxxx/
- https://lore.kernel.org/git/CAHk66fvt-1RaLK8E7SDpocWM9OMAcA-gP5hjHq6r5N_FbATNgA@xxxxxxxxxxxxxx/
- https://github.com/git/git/pull/647#issuecomment-591978405

and others.

## Post GSoC

I would love to keep contributing to git after the GSoC period ends. There`s so
much to learn from the community.

Hannes`s comment on checks as a penalty that should be paid only by constant
strbufs was a perspective I had not considered [13].

Interacting with Kyagi made me rethink the justifications _emphasizing commit
messages_. I was at my wit`s end, which makes me appreciate my patient mentors
more and want to give back to the community.

[13]: https://lore.kernel.org/git/467c035f-c7cd-01e1-e64c-2c915610de01@xxxxxxxx/

## About Me

I am Abhishek Kumar, a second-year CSE student at National Institute of
Technology Karnataka, India. I have a blog where I talk about my interests -
programming, fiction, and literature [14].

I primarily work with C/C++ and Ruby on Rails. I am a member of my institute`s
Open Source Club and student-built University Management System, _IRIS_. I have
some experience of mentoring - Creating their code style guide and being an
active reviewer [15].

[14]: https://abhishekkumar2718.github.io/

[15]: https://iris.nitk.ac.in/about_us

## Availablity

The official GSoC coding period runs from April 27 to August 17.

Due to the outbreak of COVID-19 in my country, my college has pre-emptively
announced summer vacations from March 17 to June 1. Unfortunately,
I would have classes for a large part of the coding period. However, I can still
contribute 35-40 hours every week due to a low course load (~20 hours a week).

I would not be able to contribute from June 29 to July 4 due to end semester exams.
It would be easily compensated during the subsequent week "semester break" from
July 5 to July 27.

## Contact Information

```
     Name: Abhishek Kumar
    Email: abhishekkumar8222@xxxxxxxxx
    Major: Computer Science And Engineering
   Degree: Bachelor of Technology
Institute: National Institute of Technology Karnataka
   Github: abhishekkumar2718
 Timezone: UTC+5:30 (IST)
```

Thank you for taking the time to review my proposal!



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux