Re: Contributor Summit Topics and Logistics

Derrick Stolee <stolee@xxxxxxxxx> · Tue, 29 Jan 2019 13:22:37 -0500

I was hoping to attend the contributors' summit remotely, but now my leave is
starting before then. This email contains a summary of what I would have
added to the discussion.

Thanks,
-Stolee

Commit-Graph Status Report
==========================

I'm really happy with the progress in this area, especially with the number of
other contributors working on the feature! Thanks Ævar, Jonathan, Josh, Stefan,
and Szeder in particular.

Here are some directions to take the feature in the near future:

File Format v2
--------------

The new format version [1] specifically fixes some shortcomings in v1:

* Uses the 4-byte format id for the hash algorithm.
* Creates a separate version byte for the reachability index.
* Enforces that the unused byte is zero until we use it for incremental writes.

Hopefully, this is the last time we need to update the file header.

[1] https://public-inbox.org/git/pull.112.git.gitgitgadget@xxxxxxxxx/
    [PATCH 0/6] Create commit-graph file format v2

Reachability Index
------------------

As discussed on-list [2], we want to replace generation numbers with a different
(negative-cut) reachability index. I used the term "corrected commit date". The
definition is:

* If a commit has no parents, then its corrected commit date is its commit date.

* If a commit has parents, then its corrected commit date is the maximum of:
    - its commit date
    - one more than the maximum corrected commit date of its parents

The benefits of this definition were discussed already, but to summarize:

* This definition will work _at least as well_ as the commit date heuristic,
  with the added bonus of being absolutely sure our results are right. We can
  update algorithms like paint_down_to_common() to use this reachability index
  without performance problems in some cases.

* If someone creates a terrible commit with a date that is far in the future,
  this definition is no worse than existing generation numbers (because we
  enforce that the corrected commit date is strictly larger than the parents'
  corrected commit date).

To implement this index, we can re-use the 30 bits per commit in the
commit-graph file that are used for generation numbers, but use them instead
for the difference between the corrected commit date and the actual commit
date. File format v2 gives us a version value that can be incremented to signal
the change in meaning.

Some work is required to adjust the existing generation-number-aware algorithms
to care about an "arbitrary" reachability index. It could be as easy as a
helper function that returns a function pointer to the proper compare function.

If someone wants to move forward on this topic while I'm gone, please
volunteer. Otherwise, this will be among my first items to work on when I
return from leave.

[2] https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@xxxxxxxxx/
    [RFC] Generation Number v2

Incremental Writes
------------------

Similar to the split index, an incremental commit-graph file can be implemented
to reduce the write time when adding commits to an existing (large)
commit-graph. In this case, the .git/objects/info/commit-graph file would
be small, and have a pointer to a base file, say "cgraph-<hash>.cgraph", that
contains the majority of the commits.

The important thing to keep in mind here is that we use integers to refer to
a commit's parents. This integer would need to refer to the order of commits
when you concatenate the orderd lists from each file. When doing this, we
can point into the base file as well as the tip file. Since the base
commit-graph file would be closed under reachability, it only needs to care
about commits in its file.

It is also possible to have multiple base files, and we can use the unused
byte in the commit-graph file format v2 to store the number of base files.
We can then store a list of file names in a new chunk, presenting the ordered
list of base files. We still want to keep this list short, but there may be
benefits to a variable number. I expect the first version would limit the
construction to one base file for simplicity's sake.

When this is implemented, we can use it to write the commit-graph at fetch
time. A config setting, say 'fetch.writeCommitGraph', could enable this write.
Since most writes would add a small number of commits compared to the large
base file, this would be a more reasonable cost to add to a fetch. Since we
verify the pack upon download, the commits it contained will already be in
the memory cache and we won't need to re-parse those commits.

Volunteers welcome.

Bloom Filters
-------------

Using bloom filters to speed up file history has been discussed and prototyped
on-list (see [12] and the thread before it). Thanks for lots of contributions
in this area! A lot of people have shown an interest in this feature, and it
is particularly helpful with server-side queries.

Any implementation here should check that it is helping 'git blame' as much as
it can [13]. It's entirely possible that the performance problem mentioned
there is more about the size of the file and not finding the commits that
changed the file, but it's worth digging in here.

A few people have mentioned that they are interested in pursuing this
implementation, so it would be good to declare intentions during the summit.

[12] https://public-inbox.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@xxxxxxxxx/

[13] https://public-inbox.org/git/CABXAcUzoNJ6s3=2xZfWYQUZ_AUefwP=5UVUgMnafKHHtufzbSA@xxxxxxxxxxxxxx/

Enabled by Default?
-------------------

I proposed turning on the feature by default [3], but that had some
resistance [4] and I never followed up to that remark. (It involved the
hope that we could consolidate commit walks during a gc/repack. I'm unsure
this is a goal worth pursuing.) Since there has been more interest recently [5]
I think it would be good to discuss what concerns we may have in turning this
on by default. Specifically, make 'core.commitGraph' and 'gc.writeCommitGraph'
default to 'true'. Users could still opt-out.

[3] https://public-inbox.org/git/pull.50.git.gitgitgadget@xxxxxxxxx/
[4] https://public-inbox.org/git/xmqqlg6vvrur.fsf@xxxxxxxxxxxxxxxxxxxxxxxxx/
[5] https://public-inbox.org/git/87bm464elm.fsf@xxxxxxxxxxxxxxxxxxx/

Multi-Pack-Index Status Report
==============================

The multi-pack-index feature shipped with Git 2.20! We've been using this
feature (or, a similar implementation as it changed a lot with review) in
VFS for Git for a year now. It's been critical to solving the many-packs
problem we have with our prefetch packs model. Our next version ships with
Git 2.20 and the upstream implementation.

We are now able to start tackling our space problem with these many packs.
Our solution includes the 'expire' and 'repack' subcommands [6]. We will run
these in the background [7] to slowly reduce the space we are using. Since
Git references the multi-pack-index, we are able to delete packs that have
no referenced objects from the multi-pack-index without interrupting user
commands (I don't think the same holds for 'git repack'). This "highly
available" model makes me think that this could be useful to other scenarios.

We are looking for interest from other users or groups in this feature. We
want this feature to be adopted, and that means the future of the feature 
should depend on more scenarios than our specific case.

Here are some ideas to make this more useful for others:

1. Incremental writes. See the commit-graph section for details. This would
   allow writing the multi-pack-index on fetch, helping users who have set
   gc.auto=0 keep performance high even though they have packs piling up.

2. Stable object order and bitmaps. This is discussed in the design
   document [8]. This is more useful for server environments.

[6] https://public-inbox.org/git/pull.92.git.gitgitgadget@xxxxxxxxx/
    [PATCH 0/5] Create 'expire' and 'repack' verbs for git-multi-pack-index

[7] https://github.com/Microsoft/VFSForGit/blob/9cad154293456a41bef593a75e1ad2cb840c8524/GVFS/GVFS.Common/Maintenance/PackfileMaintenanceStep.cs#L141-L158
    The use of 'expire' and 'repack' in VFS for Git

[8] https://github.com/git/git/blob/master/Documentation/technical/multi-pack-index.txt#L77-L84
    multi-pack-index and stable object order

Test Coverage Report
====================

My intentions creating the test coverage report were to avoid bugs by double-
checking that we are testing all logic that was both (1) non-trivial, and
(2) new. The report does tend to be noisy with a lot of trivial blocks (error
cases) or code that was not covered before but was updated with a mechanical
refactoring. I'm hoping to attack these issues by using a new approach when
generating the reports.

I've created a GitHub repo [9] that contains new logic for generating the test
coverage report. In particular, it will now generate a text report that will
be sent to the list, but also an HTML report that will be posted online (see
[10] for an example).

In addition, the repo has an 'ignored' directory. This directory will be filled
with files that mirror their corresponding files in the Git repo, but contain
line numbers and contents for lines that have been deemed "unimportant". For
instance, I didn't want to just ignore all lines that say simply "return;" but
we can check that line 302 of builtin/checkout.c says "return;" and ignore that
line in the report [11].

I'll try to review the test report and add ignored lines before generating the
next report. I'll also accept PRs that add ignored lines (with justification).

I think this will help the usefulness significantly, especially as topics merge
down into 'next' and 'master'. If we track the ignored lines throughout a cycle,
then the report for 'maint' versus 'master' near release time may actually be
reasonable to read.

Any other feedback on the reports is greatly appreciated!

[9] https://github.com/derrickstolee/git-test-coverage

[10] https://derrickstolee.github.io/git-test-coverage/reports/2019-01-29.htm

[11] https://github.com/derrickstolee/git-test-coverage/blob/master/ignored/builtin/checkout.c