On Mon, Jul 01, 2013 at 09:02:31PM -0600, Martin Fick wrote: > A simple synthetic test case with 1M refs all pointing to the same > sha1 seems to be easily handled by git these days. However, in our > experience with our internal git repo, we still have performance > issues related to having too many refs, in our kernel/msm instance we > have around 400K. I'm not too surprised. There's O(n^2) behavior in fetch-pack's mark_complete function, as it adds each of the N ref tips to a commit_list, sorting by date (so each insertion is O(n)). I posted two alternative patches in May of 2011. The first simply avoids adding duplicate objects, which is simple and covers many real-world cases (e.g., an "alternates" repository which has a bunch of copies of the same tags, one per fork). The second one switches the commit_list out for a heap-based priority queue. We ended up taking the first (as ea5f220), since it was trivial and obviously correct, but didn't bother with the second since: 1. There had been no real-world reports of it. 2. While in theory a priority queue implementation would be used in other spots, too, it ended up being a pain to use it, as most of the callers wanted list-like splicing. You can see the original here: http://thread.gmane.org/gmane.comp.version-control.git/174003/focus=174005 Though it probably doesn't apply cleanly anymore. However, I've kept it rebased over the years at: git://github.com/peff/git.git jk/fast-commit-list Junio recently added a priority queue implementation in b4b594a (prio-queue: priority queue of pointers to structs, 2013-06-06), which is currently in next. So a modern version of that series would build on top of that, rather than my priority queue. And yet another alternative would be to keep the list unsorted during the mark_complete calls, and then sort it at the end. Like this: diff --git a/fetch-pack.c b/fetch-pack.c index abe5ffb..4df8abd 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -505,7 +505,7 @@ static int mark_complete(const char *refname, const unsigned char *sha1, int fla struct commit *commit = (struct commit *)o; if (!(commit->object.flags & COMPLETE)) { commit->object.flags |= COMPLETE; - commit_list_insert_by_date(commit, &complete); + commit_list_insert(commit, &complete); } } return 0; @@ -622,6 +622,7 @@ static int everything_local(struct fetch_pack_args *args, if (!args->depth) { for_each_ref(mark_complete, NULL); for_each_alternate_ref(mark_alternate_complete, NULL); + commit_list_sort_by_date(&complete); if (cutoff) mark_recent_complete_commits(args, cutoff); } (If you're wondering why we didn't do this trivial bit at the time, it was because back then we did not yet have the René's nice linked-list mergesort that backs commit_list_sort_by_date). > The result, a copy of linus' repo with a million unique > valid refs and a git fetch of a single updated ref taking a > very long time (55mins and it did not complete yet). Note, > with 100K refs it completes in about 2m40s. It is likely > not linear since 2m40s * 10 would be ~26m (but the > difference could also just be how the data in the sha1s are > ordered). That sounds like the O(n^2) problem. My timings back then with 100K refs were 1-2 minutes. Does the patch above fix it for you? -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html