Re: [RFC] Generation Number v2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Derrick Stolee <stolee@xxxxxxxxx> writes:
> On 11/1/2018 8:27 AM, Jakub Narebski wrote:
>> Derrick Stolee <stolee@xxxxxxxxx> writes:
>>
>>> Please also let me know about any additional tests that I could
>>> run. Now that I've got a lot of test scripts built up, I can re-run
>>> the test suite pretty quickly.
>>
>> I would add straighforward 1-to-N and N-to-1 reachability tests in the
>> form of `git branch / tag --contains` and `git branch / tag --merged`,
>> and perhaps also ahead-behind calculations used by `git status`, and
>> N-to-M reachability tests used by tag following code in push and fetch.

1-to-N and N-to-1 can be done with `git branch --merged` / `--contains`.

>> Possibly also A...B walk, if it is not implemented via calculating
>> merge-base.
>
> I believe this uses paint_down_to_common(), but looks at the PARENT1 and
> PARENT2 flags afterward to determine the left/right/both relationships.

Right, so I guess this is the same internal mechanism that `git
merge-base` utilizes, just used a bit differently.  Thus benchmarking
`git merge-base` should cover also A...B performance, I guess.

>> Maybe even --ancestry-path walk, though I am not sure how important best
>> performance for rhis case is (we would want good performance, but
>> perhaps best is not needed for rarely used command).
>
> Currently, the implementation of --ancestry-path does not use a
> reachability index.

Well, using reachability index would certainly give it a boost.

[...]
>>> Generation Number Performance Test
>>> ==================================
>>>
>>> Git uses the commit history to answer many basic questions, such as
>>> computing a merge-base. Some of these algorithms can benefit from a
>>> _reachability index_, which is a condition that allows us to answer
>>> "commit A cannot reach commit B" in some cases. These require pre-
>>> computing some values that we can load and use at run-time. Git
>>> already has a notion of _generation number_, stores it in the commit-
>>> graph file, and uses it in several reachability algorithms.
>>
>> Note that there are other kinds of reachability indices.
>>
>> First, there are reachability indices that can answer the full
>> reachability query (if commit A can reach commit B, or if commit A
>> cannot reach commit B) directly, without walking the commit graph at
>> all: so called label-only approach.  For example one could store for
>> each commit the compressed list of all commits reachable from it
>> (transitive closure compression).
>>
>> Those, I think (but I have not checked), would be of not much use for
>> Git, as the size of the index grows stronger than linear with the number
>> of commits, as grows the time to compute such index.  So probably of no
>> use to Git, at least not directly (Git uses so called "bitmap index",
>> see e.g. [1], which stores reachability bit-vector as compressed
>> bitmap... but not for all commits, only for a small subset).
>>
>>
>> Second, beside negative-cut reachability indexes, that can answer with
>> certainity that "commit A cannot reach commit B", like the generation
>> numbers (also known as level, or topological level), there also
>> positive-cut indexes (usually if not always based on the spanning tree
>> or trees for the DAG), that can answer when "commit A can reach commit
>> B".
>>
>> I think that those can lead to significant speedups for at least some
>> types of commit traversal and reachability queries that Git needs to
>> answer.  But currently no algorithms has a provision for using
>> positive-cut filter index.  Anyway, such index would be auxiliary thing,
>> see the next point.
>>
>>
>> Third, we require more than having reachability index in the sense of
>> having some kind of _labelling_, often composite, that can answer either
>> "commit A cannot reach commit B" or "commit A can reach commit B",
>> depending on the type.  Git for most operations needs more, namely an
>> actual _ordering_, that is a weak order (or to be more exact a total
>> preorder, i.e. there can be two different commits with the same
>> "generation number" or index, but always either idx(A) ≲ idx(B) or
>> idx(B) ≲ idx(A)) and not only partial ordering (where there can be
>> incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)).
>> This is because it needs to be used in priority queue to decide which
>> commit to travel next; more on that later.  This is also why there can
>> be only one such "main" reachability _index_.
>>
>> [1]: https://githubengineering.com/counting-objects/
>
> Thanks for the details. At the moment, I'm only interested in improving our
> negative-cut reachability index, as we have algorithms that can take
> advantage of them (and can compare their performance directly).

Right, I can agree with that.  Positive-cut reachability index use would
need to be added separately.

What I wanted to emphasize is the need for a _number_ (or a total
preorder), not simply an _index_ (a label, perhaps a composite one).
This means, as I wrote, that there would be only one main "generation
number".  It also limits the number of possible reachability indexes to
consider.

[...]
>> In Git there are at least four different types of graph traversal
>> queries, with somewhat different requirements, and affected differently
>> by various reachability indexes.
>>
>>
>> ## Pure reachability queries
>>
>> First there are pure reachability queries, when we are interested only
>> in nswering he question whether commit A can reach commit B or not.  We
>> are nt interested in list of commits between A and B; if reachability
>> index, be it negative-cut or positive-cut, can answer the question, it
>> is all we need.  If we need to walk (perform online search), then we are
>> interested in just being able to arrive; we need to walk only one path.
>>
>> This query can be performed one-to-one, one-to-many (which of commits
>> from the list can be reached from commit A), many-to-one (which commits
>> from the list can reach commit B), and many-to-many -- the latest for
>> example with recently added get_reachable_subset().
>>
>> These types of queries are done by "git branch --contains" (and
>> "--no-contains") and "git branch --merged" (and "--no-merged"), the same
>> for "git tag"; if I remember it correctly those were among first
>> operations that were sped up by commit-graph mechanism.  Many-to-many
>> queries are done for tag following in "git push" and "git fetch".
>>
>> Note that those queries are missing from gen-test.
>
> These are definitely a good idea to pursue further testing. The only
> thing I can think of right now is that the tests can be hard to set up,
> but perhaps`git branch --remote --contains` or `git tag --contains`
> would be interesting.

Also `git tab --merged`, for the "reverse" check; then this should cover
both sides of N-to-M queries used by `git fetch` (add_missing_tags() via
new get_reachable_subset()).

In any case, if all else fails, we can add new test-something script...

>> ## Topological ordering
>>
>> Here reachability index is used to find out in-degree of considered
>> commits (starting from a set of starting points).  Here we don't really
>> need to walk all paths, but we need to walk all possible ways of
>> arriving at given commit.
>
> You make a good point here about the difference between "all paths to C"
> and "all children of C", since the number of paths can grow exponentially
> (and frequently does in Git) but the number of children is linear (and
> small in practice).

Sidenote: I wonder how the distribution of number of parent
(out-degree), numbers of childern (in-degree), and number of paths to a
commit looks like.

>> Though I guess the [new] incremental algorithm could be modified to only
>> check for in-degree being zero or non-zero.  Would it make it slower or
>> faster; I guess in absence of good positive-cut reachability filter it
>> would make it slower.
>
> I don't understand how a positive-cut filter helps in this case. The point
> of the in-degreecalculation is to say "we already output all commits that
> can reach this commit" beforewe output another commit. How could a
> positive-cut filter help here?

That was more of an idle thought than a concrete proposal.

With the new 3-phase algorithm, we populate topo_queue with commits of
an in-degree of zero, calculating this on-degree if necessary.  I
thought that those pre-candidates could have been eliminated early with
positive-cut index, instead of having to compute in-degree in full.

[...]
>> ## Strict ancestry, or ancestry path
>>
>> When given a range of commits to display (e.g. A..B), only display
>> commits that exist directly on the ancestry chain between B and A,
>> i.e. commits that are both descendants of B, and ancestors of A.  In
>> other words, find all paths from B to A (if they exist).
>>
>> In this case we want to walk all the paths, heavily pruning.  This
>> should be less dependent on reachability index quality as an index (as
>> an ordering), and more on pruning ability of the filter.
>>
>> This case is not tested in gen-test, but I wonder how often this feature
>> is actualy used (if it would be worth adding it to the benchmark, and if
>> so, with what weight behind it).
>
> As mentioned earlier, the --ancestry-path algorithm does not currently use a
> negative-cut filter, so we would not gather any data on this at the moment.
>
> However, I imagine the algorithm is similar to the typical A..B case, as we
> need to discover the (po)set difference, construct the reversed digraph on
> those commits, and walk backwards from A. Assuming the difference is much
> smaller than the entire commit graph, then the main cost is the walk that
> discovers the difference, and hence is covered somewhat by the `git log
> --topo-order A..B` tests.

I would say that --ancestry-path is a little like A..B walk, a little
like the in-degree computing in --topo-order (because in in-degree
computing we are also interested only in paths that arrive at given
commit).

[...]
>>> **V1: (Epoch, Date) Pairs.**
>>>
>>> For each commit, store a pair of values called the _epoch_ and the _date_.
>>> The date is the normal commit date. The _epoch_ of a commit is the minimum
>>> X such that X is at least the maximum epoch of each parent, and at least
>>> one more than the epoch of any parent whose date is larger than the date
>>> of this commit (i.e. there is clock skew between this commit and this
>>> parent). In this way, we can use the following reachability index:
>>>
>>>     If epoch(A) < epoch(B), then A cannot reach B.
>>>     If epoch(A) == epoch(B) and date(A) < date(B), then A cannot reach B.
>>
>> I wonder what makes it perform worse than corrected date aka V3.
>
> In the example `git merge-base v4.8 v4.9` in the Linux repo, the topology
> includes two commits (say, C1 and C2) of low generation number but high
> commit-date. These commits also have low epoch. However, there are 607
> commits in the repo whose commit date is smaller than a parent's commit
> date. This means that the epoch in themain trunk of the repo can be as
> high as 607, while the epoch for C1 and C2 is likely in the single digits.
> This means that we need to walk all commits with epoch greater than
> min { epoch(C1), epoch(C2) } before we explore C1 and C2 and terminate
> the walk.
>
> In V3, the commits C1 and C2 have high corrected commit date, higher than
> any of the commits that require positive offset to overcome the clock skew
> with their parents. This allows the walk to be very similar to the old
> algorithm, as seen in this `git merge-base A B` test summary:
>
> Linux c8d2bc9bc39e 69973b830859
[I have added names of algorithms / generationn numbers here]

>   OLD: 167468        -----         no reachability index
>    V0: 635579        gen(C)        (Minimum) Generation Number
>    V1: 630138        epoch(C)      (Epoch, Date) Pairs
>    V2:  33716        maxgen(C)     Maximum Generation Numbers
>    V3: 167496        cdate(C)      Corrected Commit Date
>    V4: 153774        felineX(C)    FELINE Index

Ah, so the problem with (Epoch, Date) pairs is that the trunk acquires
high epoch, while side branches (which may be shortcuts) often have
lower epoch.  Makes sense, if maybe not obvious without such
examination.

>>> **V2: Maximum Generation Numbers.**
>>> The _maximum generation number_ of a commit (denoted by maxgen(A)) is
>>> defined using its _children_ and the total number of commits in the repo.
>>> If A is a commit with no children (that is, there is no commit B with
>>> A as a parent) then maxgen(A) is equal to the number of commits in the repo.
>>> For all other commits A, maxgen(A) is one less than the minimum maxgen(B)
>>> among children B. This can be computed by initializing maxgen(C) to the
>>> number of commits, then walking the commit graph in topological order,
>>> assigning maxgen(P) = min { maxgen(P), maxgen(C) - 1 } for each parent P
>>> of the currently-walking commit C. We then have the same reachability
>>> index as minimum generation number:
>>>
>>>    If maxgen(A) < maxgen(B), then A cannot reach B.
>>
>> If I understand it correctly, this is the same as reverse generation
>> numbers, or reverse topological levels; in other words genertion numbers
>> on reversed graph -- only transformed:
>>
>>     maxgen(A) == number_of_commits - (revgen(A) - 1)

Actually using number_of_commits may not be the best choice here with
respect to number of commits that need to have maxgen(C) be recalculated
on update...

>> We define revgen(A) in the following way:
>> - for head tips (for source nodes), i.e. commits with in-degree of 0,
>>    with no children, have revgen(A) = 1 (revgen(A) = 0 is left for
>>    commits outside commit-graph, which translates to INFINITY for
>>    maxgen(A)).
>> - otherwise, it is 1 more than maximum revgen(C) of its children
>>
>> They are equivalent, but maxgen(A) is "backward compatibile", that is
>> the rechability condition is exactly the same as for ordinary generation
>> numbers:
>>
>>      If gen(A) < gen(B), then A cannot reach B.
>>      If maxgen(A) < maxgen(B), then A cannot reach B.
>>
>> But
>>
>>      If revgen(A) > revgen(B), then A cannot reach B.
>
> Yes, these are the same idea, and the reason to phrase it as I did is to
> keep the inequality in the same direction.

...and if using gen(C) as a starting point, you avoid rewriting whole
DAG on update.

Definition of maxgen'(C):

  - if A is commit without children (that is, there is no commit B with
    A as a parent), then maxgen'(A) = gen(A)
  - otherwise, maxgen'(A) = -1 + min_{B has A as parent}(maxgen'(B))

Update algorithm:

  1. compute gen(A) for all new commits, which is O(changes)
  2. select commit A with highest gen(A) that doesn't have maxgen'(A):
     [such commit does not have any children, because of gen() properties]
     and add it to queue Q
     2.1. take commit C from the queue Q
     2.2. for each parent P of commit C
          2.2.1. if maxgen'(P) <= maxgen'(C) - 1, continue
          2.2.2. update maxgen'(P) to maxgen'(C) - 1
          2.2.3. add P to queue Q

I have tested this version of maxgen(C) algorithm in a very few cases in
other post in this thread [3], and it looks like the amount of commits
that need to have maxgen() recalculated is limited.

[3]: https://public-inbox.org/git/86ftwjzv1h.fsf@xxxxxxxxx/

[...]
>>> _Commentary:_ The known examples where (minimum) generation numbers perform
>>> worse than commit date heuristics involve commits with recent commit dates
>>> but whose parents have very low generation number compared to most recent
>>> commits. In a way, minimum generation numbers increase the odds that we
>>> can say "A cannot reach B" when A is fixed and B is selected at random.
>>> Maximum generation numbers increase the odds that we can say "A cannot
>>> reach B" when B is fixed and A is selected at random. This helps us when
>>> we are exploring during a reachability algorithm and have explored few
>>> commits and want to say that the large set of unexplored commits cannot
>>> reach any of the explored commits.
>>>
>> I guess that the situation where shortcut path have recent date, while
>> having less commits in path (and thus lover ordinary generation number)
>> is more common in real commit graphs because one might want to base a
>> commit (for example a bugfix) on an old commit, but commits usually get
>> merged quickly, and not left for long time.  If they re left for long
>> time, they usually needs correction (via rebase) before being merged,
>> and again we have commit with date close to the date of merge, leading
>> to a shortcut.
>>
>> Does this reasoning looks correct to you?
>
> These "bug fixes based on old commits" cases are few and far between, so
> are not enough to explain how often maximum generation number works well.
> Instead, I would say that it works well because the typical Git pattern is
> to merge into a single trunk. In the case of projects like Linux, there are
> multiple "trunks" that are run by lieutenants and are merged into a "super
> trunk"; even in these cases, the number of trunks is much smaller than the
> number of topics. Let's focus on a single-trunk model for the
> discussion below.
>
> When working with a single trunk, most merge commits are in the first-parent
> historyof that trunk. As we travel down those trunks, the maximum generation
> is decreasing. This means that the number of commits that can have equal
> generation is limited bythe number of commits in a single topic, which is
> typically small. This is opposed to the minimum generation, which can be
> unbounded. For example, if everyone bases their Git patches on the latest
> release, then if we walk into those topics, then we need to walk all the way
> to that release to determine reachability.
>
> Here is a picture, representing each topic as an interval of consecutive
> generation values:
>
> # Minimum Generation Numbers:
>
> v2.19.0----------M1----M2----M3----M4
>     |\          /     /     /     /
>     | [topic1]-/     /     /     /
>     |\              /     /     /
>     | [topic2]-----/     /     /
>     |\                  /     /
>     | [topic3]---------/     /
>      \                      /
>       [topic4]-------------/
>
>     =====generation====>
>
> # Maximum Generation Numbers:
>
> v2.19.0----------M1----M2----M3----M4
>     |\          /      |     |     |
>     | [topic1]-/       |     |     |
>     |\                 |     |     |
>     | \-------[topic2]/      |     |
>     |\                       |     |
>     | \-------------[topic3]/      |
>      \                             |
>       \-------------------[topic4]/
>
>     =====generation====>

All right, so the problem with (minimum) generation numbers performance
will be exhibited in any project that utilizes the topic branch
workflow.

Nice to know the case.

>>> **V3: Corrected Commit Date.**
>>> For a commit C, let its _corrected commit date_ (denoted by cdate(C))
>>> be the maximum of the commit date of C and the [corrected] commit dates of its
>>> parents.
>>
>> Wouldn't it be better to use the maximum of the commit date of C, and
>> one second more than the corrected commit dates of its parents?  This
>> way corrected dates would be strictly monotonic along the chain.
>>
>> Sidenote: in theory, sufficiently bad clock skew (like e.g. wrong year
>> in the future) could screw large swatches of the commit graph.  In
>> practice, in real-life repositories, this doesn't happen, I guess.
>
> Yes, you're right. Congyi Wu from the Azure Repos server team pointed
> this out to me privately, as well. His example was someone merging a
> commit with date 3000-01-01, and making all later commits be
> useless. Adding one to the commit dates of the parents "fixes" this
> issue by reverting our negative-cut index to be equivalent to minimum
> generation number for all commits with this commit in their history.

I wonder if the problem with bad clock skew with the date far in the
future could be solved by using instead for such cases (where
date(parent) > date(commit) + large Delta) the average of the commit and
of the grandparents (parents of parent commit with such large positive
clock skew).  Though this wouldn't be strictly speaking immutable...

Another edge case, which I hope is not a problem in real-life
repositories, is clock skew very far in the past, and storing offsets in
30-bits.  If there was some malformed commit which has 0 as commit date
epoch (Unix time), i.e. 1970-01-01, then the delta / offset may not fit
in 30 bits that we use for storing generation numbers; there are 32+2 =
34 bits for comit date.  Something to think about.

[...]
>>> ### Comparing Reachability Index Versions Viability
>>>
>>> Before considering how well these indexes perform during our algorithm
>>> runs, let's take a moment to consider the implementation details and
>>> how they relate to the existing commit-graph file format and existing
>>> Git clients.
>>>
>>> * **Compatible?** In our test implementation, we use a previously unused
>>>    byte of data in the commit-graph format to indicate which reachability
>>>    index version we are using. Existing clients ignore this value, so we
>>>    will want to consider if these new indexes are _backwards compatible_.
>>>    That is, will they still report correct values if they ignore this byte
>>>    and use the generation number column from the commit-graph file assuming
>>>    the values are minimum generation numbers?
>>
>> In other words "backward compatibility" for a reachability index means
>> that the reachability condition is exactly the same as it was:
>>
>>      If index(A) < index(B), then A cannot reach B.
>>
>> and also values of the reachability index are written in place of
>> generation numbers in the commit-graph.

Note that it is backwards-compatibility only for use of generation
numbers, but of course not for an update.

>>> * **Immutable?** Git objects are _immutable_. If you change an object you
>>>    actually create a new object with a new object ID. Are the values we
>>>    store for these reachability indexes also immutable?
>>>
>>> * **Local?** Are these values **locally computable**? That is, do we only
>>>    need to look at the parents of a commit (assuming those parents have
>>>    computed values) in order to determine the value at that commit?
>>>
>> Those two features imply that the number of commits that we need to walk
>> to be able to update reachability index after appending new commits is
>> O(new commits), which means O(changes).  I would allow for simple
>> transformations of other values, e.g. adding a constant; we need to
>> rewrite commit-graph file anyway.
>>
>> I wonder if there are reachability indexes that are either immutable but
>> not local, or not immutable but local.  Doesn't being local imply
>> immutability, with some common-sense assumptions (like no random choice
>> for parent-less commits)?
>
> Local probably implies immutable, but the reverse does not. The
> example I gave elsewhere in the thread was "number of commits
> reachable from C". You can't determine this directly from the values
> of your parents, but it doesn't change even if you add more commits to
> the graph.

Right, reach(C) is immutable but not local.

Local may not mean immutable if the starting points, i.e. parentless
commits, get assingled not-immutable values.  But I don't see how that
could happen without some randomness.

>>> Data
>>> ----
>>>
>>> We focused on three types of performance tests that test the indexes
>>> in different ways. Each test lists the `git` command that is used,
>>> and the table lists which repository is used and which inputs.
>>>
>>> ### Test 1: `git log --topo-order -N`
[...]
>>>
>>> | Repo         | N     | V0     | V1     | V2     | V3     | V4    |
>>> |--------------|-------|--------|--------|--------|--------|-------|
>>> | android-base | 100   |  5,487 |  8,534 |  6,937 |  6,419 | 6,453 |
>>> | android-base | 1000  | 36,029 | 44,030 | 41,493 | 41,206 |45,431 |
>>> | chromium     | 100   |    101 |424,406 |    101 |    101 |   101 |
>>> | gerrit       | 100   |  8,212 |  8,533 |    164 |    159 |   162 |
>>> | gerrit       | 1000  |  8,512 |  8,533 |  1,990 |  1,973 | 3,766 |
>>> | Linux        | 100   | 12,458 | 12,444 | 13,683 | 13,123 |13,124 |
>>> | Linux        | 1000  | 24,436 | 26,247 | 27,878 | 26,430 |27,875 |
>>> | Linux        | 10000 | 30,364 | 28,891 | 27,878 | 26,430 |27,875 |
>>> | electron     | 1000  | 19,927 | 18,733 |  1,072 | 18,214 |18,214 |
>>> | Ffmpeg       | 10000 | 32,154 | 47,429 | 10,435 | 11,054 |11,054 |
>>> | jgit         | 1000  |  1,550 |  6,264 |  1,067 |  1,060 | 1,233 |
>>> | julia        | 10000 | 43,043 | 43,043 | 10,201 | 23,505 |23,828 |
>>> | odoo         | 1000  | 17,175 |  9,714 |  4,043 |  4,046 | 4,111 |
>>> | php-src      | 1000  | 19,014 | 27,530 |  1,311 |  1,305 | 1,320 |
>>> | rails        | 100   |  1,420 |  2,041 |  1,757 |  1,428 | 1,441 |
>>> | rails        | 1000  |  7,952 | 10,145 | 10,053 |  8,373 | 8,373 |
>>> | swift        | 1000  |  1,914 |  4,004 |  2,071 |  1,939 | 1,940 |
>>> | tensorflow   | 1000  | 10,019 | 39,221 |  6,711 | 10,051 |10,051 |
>>> | TypeScript   | 1000  |  2,873 | 12,014 |  3,049 |  2,876 | 2,876 |
>>
>> Do I understand it correctly that the range of N for a given project is
>> limited by the depth of the project history (hence maximum N of 10000
>> for Linux kernel, but only 100 for chromium)?
>
> I ran the tests with N equal to 100, 1000, and 10000 on all repos, but only
> included results for values that were interesting. For chromium, something
> about the topology let all versions (except V1) report N + 1, so I
> didn't include the other values.

Ah, that information (only interesting cases were included) would be
something good to know.

I wonder if the relative performance of different generation numbers is
correlated with the topology of the commit graph, for example with
increment patterns to intergation patterns ratio in the commit metagraph
(like --simplify-by-decoration), like in [4] and related works.

[4] Marco Biazzini, Martin Monperrus, Benoit Baudry "On Analyzing the
    Topology of Commit Histories in Decentralized Version Control
    Systems", DOI: 10.1109/ICSME.2014.48 (2014)

>> I wonder what the OLD numbers are for these operations.
>
> The OLD numbers are equal to the number of commits reachable from HEAD
> in every case. I didn't think this was interesting to report. You can
> see the numbers for yourself in the output data file:
>
> https://github.com/derrickstolee/gen-test/blob/master/topo-order-summary.txt

Thanks for the info.

-- 
Jakub Narębski




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux