Derrick Stolee <stolee@xxxxxxxxx> writes: > On 4/7/2018 12:55 PM, Jakub Narebski wrote: >> Currently I am at the stage of reproducing results in FELINE paper: >> "Reachability Queries in Very Large Graphs: A Fast Refined Online Search >> Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed >> J. Zaki (2014). This paper is available in the PDF form at >> https://openproceedings.org/EDBT/2014/paper_166.pdf >> >> The Jupyter Notebook (which runs on Google cloud, but can be also run >> locally) uses Python kernel, NetworkX librabry for graph manipulation, >> and matplotlib (via NetworkX) for display. >> >> Available at: >> https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg >> https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing >> >> I hope that could be of help, or at least interesting > > Let me know when you can give numbers (either raw performance or # of > commits walked) for real-world Git commit graphs. The Linux repo is a > good example to use for benchmarking, but I also use the Kotlin repo > sometimes as it has over a million objects and over 250K commits. As I am curently converting git repository into commit graph, number of objects doesn't matter. Though Kotlin is nicely in largish size set, not as large as Linux kernel which has 750K commits, but mich larger than git.git with 65K commits. > Of course, the only important statistic at the end of the day is the > end-to-end time of a 'git ...' command. Your investigations should > inform whether it is worth prototyping the feature in the git > codebase. What would you suggest as a good test that could imply performance? The Google Colab notebook linked to above includes a function to count number of commits (nodes / vertices in the commit graph) walked, currently in the worst case scenario. I have tried finding number of false positives for level (generation number) filter and for FELINE index, and number of false negatives for min-post intervals in the spanning tree (for DFS tree) for 10000 randomly selected pairs of commits... but I don't think this is a good benchmark. I Linux kernel sources (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) that has 750832 nodes and 811733 edges, and 563747941392 possible directed pairs, we have for 10000 randomly selected pairs of commits: level-filter has 91 = 0.91% [all] false positives FELINE index has 78 = 0.78% [all] false positives FELINE index has 1.16667 less false positives than level filter min-post spanning-tree intervals has 3641 = 36.41% [all] false negatives For git.git repository (https://github.com/git/git.git) that has 52950 nodes and 65887 edges the numbers are slighly more in FELINE index favor (also out of 10000 random pairs): level-filter has 504 = 9.11% false positives FELINE index has 125 = 2.26% false positives FELINE index has 4.032 less false positives than level filter This is for FELINE which does not use level / generatio-numbers filter. Regards, -- Jakub Narębski