On Thu, Jun 07 2018, Derrick Stolee wrote: > To test the performance in this situation, I created a > script that organizes the Linux repository in a similar > fashion. I split the commit history into 50 parts by > creating branches on every 10,000 commits of the first- > parent history. Then, `git rev-list --objects A ^B` > provides the list of objects reachable from A but not B, > so I could send that to `git pack-objects` to create > these "time-based" packfiles. With these 50 packfiles > (deleting the old one from my fresh clone, and deleting > all tags as they were no longer on-disk) I could then > test 'git rev-list --objects HEAD^{tree}' and see: > > Before: 0.17s > After: 0.13s > % Diff: -23.5% > > By adding logic to count hits and misses to bsearch_pack, > I was able to see that the command above calls that > method 266,930 times with a hit rate of 33%. The MIDX > has the same number of calls with a 100% hit rate. Do you have the script you used for this? It would be very interesting as something we could stick in t/perf/ to test this use-case in the future. How does this & the numbers below compare to just a naïve --max-pack-size=<similar size> on linux.git? Is it possible for you to tar this test repo up and share it as a one-off? I've been polishing the core.validateAbbrev series I have, and it would be interesting to compare some of the (abbrev) numbers. > Abbreviation Speedups > --------------------- > > To fully disambiguate an abbreviation, we must iterate > through all packfiles to ensure no collision exists in > any packfile. This requires O(P log N) time. With the > MIDX, this is only O(log N) time. Our standard test [2] > is 'git log --oneline --parents --raw' because it writes > many abbreviations while also doing a lot of other work > (walking commits and trees to compute the raw diff). > > For a copy of the Linux repository with 50 packfiles > split by time, we observed the following: > > Before: 100.5 s > After: 58.2 s > % Diff: -59.7% > > > Request for Review Attention > ---------------------------- > > I tried my best to take the feedback from the commit-graph > feature and apply it to this feature. I also worked to > follow the object-store refactoring as I could. I also have > some local commits that create a 'verify' subcommand and > integrate with 'fsck' similar to the commit-graph, but I'll > leave those for a later series (and review is still underway > for that part of the commit-graph). > > One place where I could use some guidance is related to the > current state of 'the_hash_algo' patches. The file format > allows a different "hash version" which then indicates the > length of the hash. What's the best way to ensure this > feature doesn't cause extra pain in the hash-agnostic series? > This will inform how I go back and make the commit-graph > feature better in this area, too. > > > Thanks, > -Stolee