On Tue, Mar 17, 2020 at 3:18 PM Jakub Narebski <jnareb@xxxxxxxxx> wrote: > > Christian Couder <christian.couder@xxxxxxxxx> writes: > > On Tue, Mar 17, 2020 at 4:13 AM Jakub Narebski <jnareb@xxxxxxxxx> wrote: > >>>> If for each commit 'v' we would compute and store in the commit-graph > >>>> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits > >>>> that were visited during the part of depth-first search that started > >>>> from 'v' (which is the minimum of post-order number for subtree of a > >>>> spanning tree that starts at 'v'). Let's call the later 'min_tree(v)'. > >>>> Then the following condition is true: > >>>> > >>>> if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u' > >>> > >>> How many places in Git do we ask "can v reach u?" and how many would > >>> return immediately without needing a walk in this new approach? My > >>> guess is that we will have a very narrow window where this query > >>> returns a positive result. > >> > >> As I wrote below, such positive-cut filter would be directly helpful in > >> performing the following commands: > >> > >> - `git merge-base --is-ancestor` > >> - `git branch --contains` > >> - `git tag --contains` > >> - `git branch --merged` > >> - `git tag --merged` > >> > >> It would be also useful for tag autofollow in git-fetch; is is N-to-M > >> equivalent to 1-to-N / N-to-1 `--contains` queries. > >> > >> I am quite sure that positive-cut filter would make `--ancestry-path` > >> walk faster. > >> > >> I think, but I am not sure, that positive-cut filter can make parts of > >> topological sort and merge base algorithms at least a tiny bit faster. > > > > Is there an easy way to check that it would provide significant > > performance improvements at least in some cases? Can we ask the > > student to do that at the beginning of the GSoC? > > The "Reachability labels for version control graphs.ipynb" Jupyter > Notebook on Google Colaboratory was created to answer this question > (originally for the FELINE reachability index). Among others it can > min-post interval labels and topological levels (generation numbers), > use them for reachability queries, and load Linux kernel > commit-graph. The exploration didn't get finished, but it would be not > that difficult, I think, to at least find the amount of false negatives > for min-post interval labeling for git.git or Linux kernel repo. > > https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg > > As Jupyter Notebook, it is run in the web browser. It can either use > local runtime, or run on Google Cloud runtime. On the other hand it > requires at least some knowledge of Python... Ok, if that's a possible approach to the project, then please add it to the description. > >> I think it would be possible to compute post(v) and min_tree(v) using > >> incremental updates, and to make it compatibile with incremental > >> commit-graph format (with the commit-graph chain). But I have not > >> proven it. > > > > Would it be difficult to prove? What would be required? And again can > > we ask the student to do that at the beginning of the GSoC? > > Formal mathematical proof by induction would be not that difficult. The > problem is, I think, in finding all possible classes of initial spanning > forest arrangements, and all possible classes of commit-graph growth by > one commit -- and examining how this affect the spanning forest. Ok. > >>> The point of generation number v2 [1] was to allow moving to "exact" > >>> algorithms for things like merge-base where we still use commit time > >>> as a heuristic, and could be wrong because of special data shapes. > >>> We don't use generation number in these examples because using only > >>> generation number can lead to a large increase in number of commits > >>> walked. The example we saw in the Linux kernel repository was a bug > >>> fix created on top of a very old commit, so there was a commit of > >>> low generation with very high commit-date that caused extra walking. > >>> (See [2] for a detailed description of the data shape.) > >>> > >>> My _prediction_ is that the two-dimensional system will be more > >>> complicated to write and use, and will not have any measurable > >>> difference. I'd be happy to be wrong, but I also would not send > >>> anyone down this direction only to find out I'm right and that > >>> effort was wasted. > >> > >> That might be a problem. > >> > >> This is a bit of a "moonshot" / research project, moreso than others. > >> Though it would be still valuable, in my opionion, even if the code > >> wouldn't ultimately get merged and added into Git. > > > > I agree that it feels like a "moonshot" / research project. > > > >>> My recommendation is that a GSoC student update the > >>> generation number to "v2" based on the definition you made in [1]. > >>> That proposal is also more likely to be effective in Git because > >>> it makes use of extra heuristic information (commit date) to > >>> assist the types of algorithms we care about. > >>> > >>> In that case, the "difficult" part is moving the "generation" > >>> member of struct commit into a slab before making it a 64-bit > >>> value. (This is likely necessary for your plan, anyway.) Updating > >>> the generation number to v2 is relatively straight-forward after > >>> that, as someone can follow all places that reference or compute > >>> generation numbers and apply a diff > >> > >> Good idea! Though I am not sure if it is not too late to add it to the > >> https://git.github.io/SoC-2020-Ideas/ as the self imposed deadline of > >> March 16 (where students can start submitting proposals to GSoC) has > >> just passed. Christian, what do you think? > > > > Would that be a different project idea or part of your "Graph labeling > > for speeding up git commands" project idea? > > > > I am very reluctant to add new project ideas at that time. I don't > > think student will have time to properly research it and get it > > reviewed. > > > > It could be part of your research project though, to check if that > > approach is better or good enough compared to what you suggest in the > > current version of your project. > > > >> Would you agree, Stolee, to be a _possible_ mentor or co-mentor for > >> "Generation number v2" project? > > > > At this point I think it might be best if you are both willing to > > co-mentor a "moonshot" / research project to find what is the best way > > forward by bench-marking the different approaches that you both > > suggest for different commands/use cases. > > So perhaps we should expand "Commit graph labeling for speeding up git > commands" idea, splitting it into two possible ways of working in this > project: the more technical 'Generation number v2', and 'Interval labels > for commit graph' which is more of research project? Ok for that. > Which should be > put first, then? Which ever you prefer. If you say that any part could be done separately, that doesn't matter much. Best, Christian.