Re: How to still kill git fetch with too many refs

Jeff King <peff@xxxxxxxx> · Tue, 2 Jul 2013 00:07:58 -0400

On Mon, Jul 01, 2013 at 09:02:31PM -0600, Martin Fick wrote:

> A simple synthetic test case with 1M refs all pointing to the same
> sha1 seems to be easily handled by git these days.  However, in our
> experience with our internal git repo, we still have performance
> issues related to having too many refs, in our kernel/msm instance we
> have around 400K.

I'm not too surprised. There's O(n^2) behavior in fetch-pack's
mark_complete function, as it adds each of the N ref tips to a
commit_list, sorting by date (so each insertion is O(n)).

I posted two alternative patches in May of 2011.  The first simply
avoids adding duplicate objects, which is simple and covers many
real-world cases (e.g., an "alternates" repository which has a bunch of
copies of the same tags, one per fork). The second one switches the
commit_list out for a heap-based priority queue.

We ended up taking the first (as ea5f220), since it was trivial and
obviously correct, but didn't bother with the second since:

  1. There had been no real-world reports of it.

  2. While in theory a priority queue implementation would be used in
     other spots, too, it ended up being a pain to use it, as most of
     the callers wanted list-like splicing.

You can see the original here:

  http://thread.gmane.org/gmane.comp.version-control.git/174003/focus=174005

Though it probably doesn't apply cleanly anymore.  However, I've kept it
rebased over the years at:

  git://github.com/peff/git.git jk/fast-commit-list

Junio recently added a priority queue implementation in b4b594a
(prio-queue: priority queue of pointers to structs, 2013-06-06), which
is currently in next. So a modern version of that series would build on
top of that, rather than my priority queue.

And yet another alternative would be to keep the list unsorted during
the mark_complete calls, and then sort it at the end. Like this:

diff --git a/fetch-pack.c b/fetch-pack.c
index abe5ffb..4df8abd 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -505,7 +505,7 @@ static int mark_complete(const char *refname, const unsigned char *sha1, int fla
 		struct commit *commit = (struct commit *)o;
 		if (!(commit->object.flags & COMPLETE)) {
 			commit->object.flags |= COMPLETE;
-			commit_list_insert_by_date(commit, &complete);
+			commit_list_insert(commit, &complete);
 		}
 	}
 	return 0;
@@ -622,6 +622,7 @@ static int everything_local(struct fetch_pack_args *args,
 	if (!args->depth) {
 		for_each_ref(mark_complete, NULL);
 		for_each_alternate_ref(mark_alternate_complete, NULL);
+		commit_list_sort_by_date(&complete);
 		if (cutoff)
 			mark_recent_complete_commits(args, cutoff);
 	}

(If you're wondering why we didn't do this trivial bit at the time, it
was because back then we did not yet have the René's nice linked-list
mergesort that backs commit_list_sort_by_date).

> The result, a copy of linus' repo with a million unique 
> valid refs and a git fetch of a single updated ref taking a 
> very long time (55mins and it did not complete yet).  Note, 
> with 100K refs it completes in about 2m40s.  It is likely 
> not linear since 2m40s * 10 would be ~26m (but the 
> difference could also just be how the data in the sha1s are 
> ordered).

That sounds like the O(n^2) problem. My timings back then with 100K refs
were 1-2 minutes. Does the patch above fix it for you?

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html