Re: [RFC] Generation Number v2

Derrick Stolee <stolee@xxxxxxxxx> · Wed, 31 Oct 2018 08:30:29 -0400

On 10/29/2018 11:59 PM, Junio C Hamano wrote:
Derrick Stolee <stolee@xxxxxxxxx> writes:

**V3: Corrected Commit Date.**
For a commit C, let its _corrected commit date_ (denoted by cdate(C))
be the maximum of the commit date of C and the commit dates of its
parents.
"maximum of the commit date of C and the corrected commit dates of
its parents"?

That's what I mean. Thanks.

We've talked about exactly this one in the past (long before any of
Microsoft folks other than Dscho came to the scene) but without an
implementation, or detailed design and analysis.  I am very happy to
see the idea among other possibilities to be considered again.  This
time around, we may finally come up with something better than the
"commit dates with SLOP" thing ;-).

Essentially, the felineY order is selected with the goal of swapping
positions of topologically-independent commits relative to the felinX
ordering. The resulting reachability index is as follows:

    If felineX(A) < felineY(B), then A cannot reach B.
    If felineY(A) < felineY(B), then A cannot reach B.
I presume that the first line is a typo and you compare the same X index?

Yes, sorry for the typos. I fixed them in the report on GitHub.

* **Compatible?** In our test implementation, we use a previously unused
   byte of data in the commit-graph format to indicate which reachability
   index version we are using. Existing clients ignore this value, so we
   will want to consider if these new indexes are _backwards compatible_.
   That is, will they still report correct values if they ignore this byte
   and use the generation number column from the commit-graph file assuming
   the values are minimum generation numbers?
I personally consider that the current commit-graph with generation
numbers experimental, so I am not sure how much we care about this.

Having said that.

By the above definition, any new index that is wider than the
current generation number cannot be compatible (can we even tell the
existing clients how wide each elements in the ix array is?)

In any case, perhaps the first thing to do is to update the clients
so that they stop ignoring the version number field, and instead
work without generation number when there is no version of reach.ix
available in the file?  That way, a better reachablility index can
be chosen freely without having to worry about the compatibility.

I can work on that. It should be as simple as setting commit->generation to
GENERATION_NUMBER_ZERO in fill_commit_in_graph when the graph
has a different version.

* **Immutable?** Git objects are _immutable_. If you change an object you
   actually create a new object with a new object ID. Are the values we
store
   for these reachability indexes also immutable?
Even if we do not embed the reachability ix in commit objects,
having an immutable value is probably a must if we want to make them
incrementally computable, so this is a very good property to have.
Unless there is a clever idea to incrementally compute a mutable
reach.ix, my gut instinct says that this property is a must.

Another thing, perhaps related to "local" below, is if exactly the
same reach.ix is computed by anybody, given an identical commit
history graph (perhaps "reproducibility"?).  I think most of the
candidates you listed are reproducible without a fixed tie-breaker,
but I am not sure about felineY() thing.

* **Local?** Are these values **locally computable**? That is, do we only
   need to look at the parents of a commit (assuming those parents have
   computed values) in order to determine the value at that commit?
A subset of non-local reachability ix, for example, the ones that
need to know what each commit's children are, cannot be immutable,
as adding new objects to the graph (either with locally committing,
or transferring objects from other repositories) would affect the
ix; is this true for all non-local reachability ix, I wonder?

As a thought experiment, we could define a function size(C) to be the
numberof commits reachable from C. This is not locally-computable
from the size values at C's parents due to the inclusion-exclusion
principle. We would need to compute it by walking the reachable set
and counting (resulting in quadratic performance overall) but is
immutable. Since the performance cost is so expensive (unlike the
linear costs in the other non-local versions) I didn't include it
in my comparison.

We focused on three types of performance tests that test the indexes
in different ways. Each test lists the `git` command that is used,
and the table lists which repository is used and which inputs.

### Test 1: `git log --topo-order -N`

This test focuses on the number of commits that are parsed during
a `git log --topo-order` before writing `N` commits to output.
A devil's advocate comment.  Your patches seem to be very focused on
this "unlimited" case for the past few weeks, but I am not so sure
if that is a case worth optimizing for.  If "git log --topo-order -N
HEAD~M.." (for some number M) gives us a similar result as unlimited
case but with much less effort, wouldn't it be good enough that lets
us concentrate on other use cases instead?

I mostly included these statistics to make sure we didn't add a 
regression in
this case. Note that I didn't report the OLD values in this table, 
because that
would be an unfair comparison.

Based on the performance results alone, we should remove minimum
generation numbers, (epoch, date) pairs, and FELINE index from
consideration. There are enough examples of these indexes performing
poorly.
OK.

In contrast, maximum generation numbers and corrected commit
dates both performed quite well. They are frequently the top
two performing indexes, and rarely significantly different.

The trade-off here now seems to be: which _property_ is more important,
locally-computable or backwards-compatible?
Nice summary.

As I already said, I personally do not think being compatible with
currently deployed clients is important at all (primarily because I
still consider the whole thing experimental), and there is a clear
way forward once we correct the mistake of not having a version
number in the file format that tells the updated clients to ignore
the generation numbers.  For longer term viability, we should pick
something that is immutable, reproducible, computable with minimum
input---all of which would lead to being incrementally computable, I
would think.

This is good reasoning. The "reproducible" property is also important for
support reasons, too! Sounds like the corrected commit date is the best
way forward.

Aside: I spent some time thinking about making the corrected commit dates
backward compatible by ensuring the offsets are monotonic in the commit
history (so we could store the offset as commit->generation and the existing
generation comparisons would still work). However, it performs poorly on the
Linux repository 'git merge-base v4.8 v4.9' example.

Thanks,
-Stolee