Re: [PATCH 3/3] revision: insert unsorted, then sort in prepare_revision_walk()

Jeff King <peff@xxxxxxxx> · Mon, 2 Apr 2012 16:14:33 -0400

On Sun, Apr 01, 2012 at 12:11:01AM +0200, René Scharfe wrote:

> Speed up prepare_revision_walk() by adding commits without sorting
> to the commit_list and at the end sort the list in one go.  Thanks
> to mergesort() working behind the scenes, this is a lot faster for
> large numbers of commits than the current insert sort.

I think this is probably a sane thing to do, but I have two slight
misgivings:

  1. Is it worth the complexity of the linked-list mergesort? I was
     planning to just build an array, qsort it, and then put the results
     into a linked list. The patch for that is below for reference.

     It's a lot less code and complexity for the same performance
     (actually, I measured it at 1% faster, but that is probably
     negligible). The downside is that it is not nicely encapsulated in
     commit_list_sort_by_date(). We call the latter from two other
     places; I don't know if they can be fed with enough commits to
     actually benefit from the performance gain or not.

  2. I'm not super happy about fixing this one spot. This quadratic
     behavior comes up in a lot of places, and we're slowly hacking them
     one by one. E.g., this does nothing to help the same case in
     fetch-pack.c:mark_complete[1]. Nor does it help the fact that
     when we follow parents, we will do an O(n) insert_by_date for each
     commit we insert. The latter is largely saved by the locality of
     timestamps (i.e., timestamps of the parents of recently popped
     commits tend to be near the front of the list), as well as the hack
     in fce87ae (Fix quadratic performance in rewrite_one., 2008-07-12).

     So I wonder if in the long term we would benefit from a better data
     structure, which would make these problems just go away. That being
     said, there is a lot of code to be updated with such a change, so
     even if we do want to do that eventually, a quick fix like this is
     probably still a good thing.

-Peff

[1] I fixed the mark_complete thing in ea5f220 (fetch: avoid repeated
    commits in mark_complete, 2011-05-19), but only for exact-duplicate
    commits. The real-world case where it came up was an "alternates"
    repository that held refs for many clones (so we had hundreds or
    thousands of copies of each tag). But on a repository like the one
    we are testing on, I think it would be similarly slow.

---
Here's the qsort-in-array patch, for reference.

diff --git a/revision.c b/revision.c
index b3554ed..22c26d0 100644
--- a/revision.c
+++ b/revision.c
@@ -2062,10 +2062,24 @@ static void set_children(struct rev_info *revs)
 	}
 }
 
+static int commit_compare_by_date(const void *va, const void *vb)
+{
+	const struct commit *a = va;
+	const struct commit *b = vb;
+	if (a->date < b->date)
+		return -1;
+	if (b->date < a->date)
+		return 1;
+	return 0;
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
 	int nr = revs->pending.nr;
 	struct object_array_entry *e, *list;
+	struct commit **commits = NULL;
+	int commits_nr = 0, commits_alloc = 0;
+	int i;
 
 	e = list = revs->pending.objects;
 	revs->pending.nr = 0;
@@ -2076,11 +2090,17 @@ int prepare_revision_walk(struct rev_info *revs)
 		if (commit) {
 			if (!(commit->object.flags & SEEN)) {
 				commit->object.flags |= SEEN;
-				commit_list_insert_by_date(commit, &revs->commits);
+				ALLOC_GROW(commits, commits_nr + 1, commits_alloc);
+				commits[commits_nr++] = commit;
 			}
 		}
 		e++;
 	}
+	qsort(commits, commits_nr, sizeof(*commits), commit_compare_by_date);
+	for (i = commits_nr - 1; i >= 0; i--)
+		commit_list_insert(commits[i], &revs->commits);
+	free(commits);
+
 	if (!revs->leak_pending)
 		free(list);
 
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html