Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes: >> Having such a complicated two-dimensional system would need to >> justify itself by being measurably faster than that one-dimensional >> system in these example commands. >> >> [...] >> >> My _prediction_ is that the two-dimensional system will be more >> complicated to write and use, and will not have any measurable >> difference. I'd be happy to be wrong, but I also would not send >> anyone down this direction only to find out I'm right and that >> effort was wasted. > > Agreed. I have been through the papers of the involved variants and on graphs > comparable to some of the largest git repositories, the performance improves by > fifty nanoseconds for a random query. I would recommend extending results for other types of large graphs to the commit graphs with care. The characteristics of those graphs are quite different from characteristics of commit graph: they usually are scale-free graphs, with low maximum level, and low connectivity: the probability of two random nodes being connected in order of 10^-3 or 10^-4; see e.g. https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=99 The last one, called R-ratio, means that testing on random query actually tests mainly negative-cut filters. That is why some papers provide either separate numbers for negative and for positive queries, or separate numbers for random and for balanced queries. > Additionally: > 1. They require significantly more space per commit. This depends on the type of the algorithm: is it Label-Only (answering reachability queries without consulting graph), or Label+Graph (augmented online search algorithms): https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=78 The CDAT chunk in current version of commit-graph format takes H+16 bytes per commit (where H is the size of object id hash). From those H+16 bytes 30 bits (slightly less that 4 bytes) are used for current reachability label: the topological level aka generation number. https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=45 The proposed min-post interval label would take 8 bytes per commit, that is 4 bytes per single number in interval. That is not much, provided that we get visible performance improvements for at least some often used git commands. > 2. They require significantly more preprocessing time. This again depends on the type of algorithm: Label-Only or Label+G. In the case of min-post interval labels, they can be computed together with generation number, during the same commit-graph walk. The amount of calculations required to compute min-post interval is not much. Therefore I think it would be not unreasonable cost. >> My recommendation is that a GSoC student update the >> generation number to "v2" based on the definition you made in [1]. >> That proposal is also more likely to be effective in Git because >> it makes use of extra heuristic information (commit date) to >> assist the types of algorithms we care about. > >> In that case, the "difficult" part is moving the "generation" >> member of struct commit into a slab before making it a 64-bit >> value. (This is likely necessary for your plan, anyway.) Updating >> the generation number to v2 is relatively straight-forward after >> that, as someone can follow all places that reference or compute >> generation numbers and apply a diff > > Thanks for the recommendation. Reading about how this fits in more > with REU on the other thread, I too agree that updating generation > number to use corrected commit date would be more appropriate for a GSoC > project. You can read about monotonically offset corrected commit date for example on my slides, from page 64 (the problem) to 75 (references): https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=64 I'll try to update the proposal on SoC 2020 Ideas page soon, perhaps tomorrow when I will have more free time. https://git.github.io/SoC-2020-Ideas/ In the meantime I will refer to what is written at the top of it: > **Students:** Please consider these ideas as starting points for > generating proposals. We are also more than happy to receive proposals > for other ideas related to Git. Make sure you have read the “Note > about refactoring projects versus projects that implement new > features” in the general application information page though. Best, -- Jakub Narębski