Derrick Stolee <stolee@xxxxxxxxx> writes: > On 10/29/2018 11:59 PM, Junio C Hamano wrote: >> Derrick Stolee <stolee@xxxxxxxxx> writes: [...] >>> * **Compatible?** In our test implementation, we use a previously unused >>> byte of data in the commit-graph format to indicate which reachability >>> index version we are using. Existing clients ignore this value, so we >>> will want to consider if these new indexes are _backwards compatible_. >>> That is, will they still report correct values if they ignore this byte >>> and use the generation number column from the commit-graph file assuming >>> the values are minimum generation numbers? >> >> I personally consider that the current commit-graph with generation >> numbers experimental, so I am not sure how much we care about this. >> >> Having said that. >> >> By the above definition, any new index that is wider than the >> current generation number cannot be compatible (can we even tell the >> existing clients how wide each elements in the ix array is?) >> >> In any case, perhaps the first thing to do is to update the clients >> so that they stop ignoring the version number field, and instead >> work without generation number when there is no version of reach.ix >> available in the file? That way, a better reachablility index can >> be chosen freely without having to worry about the compatibility. > > I can work on that. It should be as simple as setting commit->generation to > GENERATION_NUMBER_ZERO in fill_commit_in_graph when the graph > has a different version. That is a very good idea, but we have here a chicken-and-egg problem. The only way to denote that the generation numbers / reachability index have changed (for reading/using: that it changed in backwards incompatibile way; for update: that it changed at all) is to change the format version: 1-byte version number: Currently, the only valid version is 1. But if we assume that different commit-graph format version simply means that commit->generation is to be set to GENERATION_NUMBER_ZERO, then we block possible changes to the structure of the commit-graph file. [Reads further in thread] Ah, I see that's why you want to introduce [1]: DS> DS> 3. Reachability Index versioning [1]: https://public-inbox.org/git/6902dbff-d9f6-e897-2c20-d0cb47a50795@xxxxxxxxx/ >>> * **Immutable?** Git objects are _immutable_. If you change an object you >>> actually create a new object with a new object ID. Are the values we store >>> for these reachability indexes also immutable? >> >> Even if we do not embed the reachability ix in commit objects, >> having an immutable value is probably a must if we want to make them >> incrementally computable, so this is a very good property to have. >> Unless there is a clever idea to incrementally compute a mutable >> reach.ix, my gut instinct says that this property is a must. I think that the reachability index can be mutable and non-local, but still being able to be incrementally updated, though perhaps it would require not only calculations for new commits, but recalculations for some limited subset of commits existing in the commit-graph file. I'm trying to modify exact definition of maximal generation numbers to check if I can find out a version that exhibits nearly-incremental updates property. >> Another thing, perhaps related to "local" below, is if exactly the >> same reach.ix is computed by anybody, given an identical commit >> history graph (perhaps "reproducibility"?). I think most of the >> candidates you listed are reproducible without a fixed tie-breaker, >> but I am not sure about felineY() thing. felineY() is deterministic (or can be made deterministic) for given felineX() (i.e. given [reverse] topological order). >>> * **Local?** Are these values **locally computable**? That is, do we only >>> need to look at the parents of a commit (assuming those parents have >>> computed values) in order to determine the value at that commit? >> >> A subset of non-local reachability ix, for example, the ones that >> need to know what each commit's children are, cannot be immutable, >> as adding new objects to the graph (either with locally committing, >> or transferring objects from other repositories) would affect the >> ix; is this true for all non-local reachability ix, I wonder? > > As a thought experiment, we could define a function size(C) to be the > number of commits reachable from C. Let's call it reach(C), instead ;-) > This is not locally-computable > from the size values at C's parents due to the inclusion-exclusion > principle. We would need to compute it by walking the reachable set > and counting (resulting in quadratic performance overall) but is > immutable. Since the performance cost is so expensive (unlike the > linear costs in the other non-local versions) I didn't include it > in my comparison. Let's define Reach(C) as set of all commits reachable from commit C. Then reach(C) = ||Reach(C)||, where ||S|| is number of elements in the set S. If commit A can reach commit B, then B ∈ Reach(A), but also Reach(B) ⊂ Reach(A), thus reach(B) < reach(A). Therefore if reach(A) <= reach(B) then A cannot reach commit B -- reach(C) can be used as reachability index. However the performance cost of calculating reach(C), i.e. full traversal of the commit graph without ability for incremental update (well, you can do incremental update starting from the commits that have reachability bitmap [2]) is prohibitive. Never mind the fact that this index would have, I think, the same performance problems as (minimum) generaton numbers. [2]: https://githubengineering.com/counting-objects/ SIDENOTE 1: as gen(C) can be thought of as the lower boundary on reach(C), similarly sumgen(C) defined as being 1 for parent-less commits, and being (sum_{P ∈ parents(C)} sumgen(P)) + 1 otherwise being the upper boundary on reach(C). This sumgen(C) can also be used as generation number / reachability index. gen(C) <= reach(C) <= sumgen(C) SIDENOTE 2: The idea of using a proxy for Reach(B) ⊆ Reach(A) as a negative-cut filter (negative-cut reachability index) is used in two types of reachability indices: one is Independent Permutation Labelling (IP+) [3] approach, the other is Bloom Filter Labelling (BFL) [4]. Both algorithms are local, and with given permutation (which can be replaced by object ID, I think) for IP+ and given set of hash functions etc. for BFL they are both immutable. [3] Hao Wei, Jeffrey Xu Yu, Can Lu, Ruoming Jin "Reachability Querying: An Independent Permutation Labeling Approach", Proceedings of the VLDB Endowment, Vol. 7, No. 12 (2014) [4] Jiao Su, Qing Zhu, Hao Wei, Jeffrey Xu Yu "Reachability Querying: Can It Be Even Faster?", IEEE Transactions on Knowledge and Data Engineering (2016) Best, -- Jakub Narębski