Re: [RFC] Generation Number v2

Jakub Narebski <jnareb@xxxxxxxxx> · Fri, 02 Nov 2018 14:33:03 +0100

Derrick Stolee <stolee@xxxxxxxxx> writes:
> On 10/29/2018 11:59 PM, Junio C Hamano wrote:
>> Derrick Stolee <stolee@xxxxxxxxx> writes:
[...]
>>> * **Compatible?** In our test implementation, we use a previously unused
>>>    byte of data in the commit-graph format to indicate which reachability
>>>    index version we are using. Existing clients ignore this value, so we
>>>    will want to consider if these new indexes are _backwards compatible_.
>>>    That is, will they still report correct values if they ignore this byte
>>>    and use the generation number column from the commit-graph file assuming
>>>    the values are minimum generation numbers?
>>
>> I personally consider that the current commit-graph with generation
>> numbers experimental, so I am not sure how much we care about this.
>>
>> Having said that.
>>
>> By the above definition, any new index that is wider than the
>> current generation number cannot be compatible (can we even tell the
>> existing clients how wide each elements in the ix array is?)
>>
>> In any case, perhaps the first thing to do is to update the clients
>> so that they stop ignoring the version number field, and instead
>> work without generation number when there is no version of reach.ix
>> available in the file?  That way, a better reachablility index can
>> be chosen freely without having to worry about the compatibility.
>
> I can work on that. It should be as simple as setting commit->generation to
> GENERATION_NUMBER_ZERO in fill_commit_in_graph when the graph
> has a different version.

That is a very good idea, but we have here a chicken-and-egg problem.
The only way to denote that the generation numbers / reachability index
have changed (for reading/using: that it changed in backwards
incompatibile way; for update: that it changed at all) is to change the
format version:

  1-byte version number:
      Currently, the only valid version is 1.

But if we assume that different commit-graph format version simply means
that commit->generation is to be set to GENERATION_NUMBER_ZERO, then we
block possible changes to the structure of the commit-graph file.

[Reads further in thread]

Ah, I see that's why you want to introduce [1]:

DS>
DS> 3. Reachability Index versioning

[1]: https://public-inbox.org/git/6902dbff-d9f6-e897-2c20-d0cb47a50795@xxxxxxxxx/

>>> * **Immutable?** Git objects are _immutable_. If you change an object you
>>>    actually create a new object with a new object ID. Are the values we store
>>>    for these reachability indexes also immutable?
>>
>> Even if we do not embed the reachability ix in commit objects,
>> having an immutable value is probably a must if we want to make them
>> incrementally computable, so this is a very good property to have.
>> Unless there is a clever idea to incrementally compute a mutable
>> reach.ix, my gut instinct says that this property is a must.

I think that the reachability index can be mutable and non-local, but
still being able to be incrementally updated, though perhaps it would
require not only calculations for new commits, but recalculations for
some limited subset of commits existing in the commit-graph file.

I'm trying to modify exact definition of maximal generation numbers to
check if I can find out a version that exhibits nearly-incremental
updates property.

>> Another thing, perhaps related to "local" below, is if exactly the
>> same reach.ix is computed by anybody, given an identical commit
>> history graph (perhaps "reproducibility"?).  I think most of the
>> candidates you listed are reproducible without a fixed tie-breaker,
>> but I am not sure about felineY() thing.

felineY() is deterministic (or can be made deterministic) for given
felineX() (i.e. given [reverse] topological order).

>>> * **Local?** Are these values **locally computable**? That is, do we only
>>>    need to look at the parents of a commit (assuming those parents have
>>>    computed values) in order to determine the value at that commit?
>>
>> A subset of non-local reachability ix, for example, the ones that
>> need to know what each commit's children are, cannot be immutable,
>> as adding new objects to the graph (either with locally committing,
>> or transferring objects from other repositories) would affect the
>> ix; is this true for all non-local reachability ix, I wonder?
>
> As a thought experiment, we could define a function size(C) to be the
> number of commits reachable from C.

Let's call it reach(C), instead ;-)

>                                    This is not locally-computable
> from the size values at C's parents due to the inclusion-exclusion
> principle. We would need to compute it by walking the reachable set
> and counting (resulting in quadratic performance overall) but is
> immutable. Since the performance cost is so expensive (unlike the
> linear costs in the other non-local versions) I didn't include it
> in my comparison.

Let's define Reach(C) as set of all commits reachable from commit C.
Then reach(C) = ||Reach(C)||, where ||S|| is number of elements in the
set S.

If commit A can reach commit B, then B ∈ Reach(A), but also Reach(B) ⊂
Reach(A), thus reach(B) < reach(A).  Therefore if reach(A) <= reach(B)
then A cannot reach commit B -- reach(C) can be used as reachability
index.

However the performance cost of calculating reach(C), i.e. full
traversal of the commit graph without ability for incremental update
(well, you can do incremental update starting from the commits that have
reachability bitmap [2]) is prohibitive.  Never mind the fact that this
index would have, I think, the same performance problems as (minimum)
generaton numbers.

[2]: https://githubengineering.com/counting-objects/

SIDENOTE 1: as gen(C) can be thought of as the lower boundary on
reach(C), similarly sumgen(C) defined as being 1 for parent-less
commits, and being (sum_{P ∈ parents(C)} sumgen(P)) + 1 otherwise being
the upper boundary on reach(C).  This sumgen(C) can also be used as
generation number / reachability index.

   gen(C) <= reach(C) <= sumgen(C)

SIDENOTE 2: The idea of using a proxy for Reach(B) ⊆ Reach(A) as a
negative-cut filter (negative-cut reachability index) is used in two
types of reachability indices: one is Independent Permutation Labelling
(IP+) [3] approach, the other is Bloom Filter Labelling (BFL) [4].  Both
algorithms are local, and with given permutation (which can be replaced
by object ID, I think) for IP+ and given set of hash functions etc. for
BFL they are both immutable.

[3] Hao Wei, Jeffrey Xu Yu, Can Lu, Ruoming Jin "Reachability Querying:
    An Independent Permutation Labeling Approach", Proceedings of the
    VLDB Endowment, Vol. 7, No. 12 (2014)

[4] Jiao Su, Qing Zhu, Hao Wei, Jeffrey Xu Yu "Reachability Querying:
    Can It Be Even Faster?", IEEE Transactions on Knowledge and Data
    Engineering (2016)

Best,
-- 
Jakub Narębski