On Thu, May 19, 2011 at 04:48:51PM -0400, Jeff King wrote: > By noting commits we have already added to the list, we can > shrink the size of "n" in such a repo to the number of > unique commits, which is on the order of what a normal repo > would contain (it's actually more than a normal repo, since child repos > may have branches at different states, but in practice it tends > to be much smaller than the list with duplicates). > > Here are the results on one particular giant repo > (containing objects for all Rails forks on GitHub): > > $ git for-each-ref | wc -l > 112514 So this seems to work pretty well for us in practice, because our giant-refs use case is a big alternates repo, and of those ~112K refs, there are only ~5K or so unique ones. Which is small enough that in this timing: > [before] > $ git fetch --no-tags ../remote.git > 63.52user 0.12system 1:03.68elapsed 99%CPU (0avgtext+0avgdata 137648maxresident)k > 1856inputs+48outputs (11major+19603minor)pagefaults 0swaps > > $ git fetch --no-tags ../remote.git > 6.15user 0.08system 0:06.25elapsed 99%CPU (0avgtext+0avgdata 123856maxresident)k > 0inputs+40outputs (0major+18872minor)pagefaults 0swaps The 6 seconds are not spent doing anything related to mark_complete, so it's effectively fast enough (i.e., we made "n" small enough that the O(n^2) behavior doesn't matter). But obviously there's still a worst-case where you get bad behavior if you have a lot of unique refs. I don't know if anybody has that situation in practice (it might be something like creating a ref for every commit). So I also considered a different approach, which is to use a better data structure to hold the sorted commit list (a priority queue). That gives a similar speedup: $ git fetch --no-tags ../remote.git 6.20user 0.03system 0:06.25elapsed 99%CPU (0avgtext+0avgdata 123712maxresident)k 0inputs+40outputs (0major+18885minor)pagefaults 0swaps and actually has good asymptotic complexity. I also had a feeling we could use a faster commit_list in other places, since get_revision is all based around the same pop-and-push-the-parents workflow. But after spending a few hours on it, it seems that _lots_ of places in the code work on the assumption of a linked list (or at least doing very simple in-order traversal), and the changes got very ugly very quickly. And performance-wise, it doesn't seem to be a huge problem, mainly because: 1. We don't tend to put more items in a commit_list than the width of the history graph. So our "n" tends to be in the dozens at most. 2. We have the hack in fce87ae (Fix quadratic performance in rewrite_one., 2008-07-12) which works OK in practice. So it's a lot of extra code, and I haven't found a case where it actually improves performance. But here's that series, just for reference. We may want to take 1/4, which is an unrelated cleanup. [1/4]: Makefile: sort TEST_PROGRAMS list [2/4]: basic priority queue implementation [3/4]: commit: add infrastructure for priority queues of commits [4/4]: fetch-pack: use priority queue for mark_complete -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html