Re: remove_duplicates() in builtin/fetch-pack.c is O(N^2)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, May 21, 2012 at 10:13:33AM +0200, Michael Haggerty wrote:

> I just noticed that the remove_duplicates() function in
> builtin/fetch-pack.c is O(N^2) in the number of heads.  Empirically,
> this function takes on the order of 25 seconds to process 100k
> references.
> 
> I know that 100k heads is kindof absurd.  Perhaps handling this many
> heads is unrealistic for other reasons.  But I vaguely recall numbers
> like this being mentioned on the mailing list.

The rails/rails network repository at GitHub (i.e., a master repo with
all of the objects and refs for all of the forks) has about 400K refs,
and has been the usual impetus for me finding and fixing these sorts of
quadratic problems.

I've never triggered this one, though, because it relies not just on
having a repo with a lot of refs, but actually fetching all of them at
one time (which we don't tend to do).

But it would be nice to fix it. There is a similar case in filter_refs,
which I mentioned here:

  http://article.gmane.org/gmane.comp.version-control.git/186994

> It would be pretty trivial to reduce the work to O(N) by using a hash
> set to keep track of the references that have already been seen.

I don't think there is any reason we can't sort the list of heads, and
then we can get rid of the duplicates with an O(n) traversal, like the
(largely untested) patch below.

> I don't plan to work on this, but I thought I would point it out in
> case it is causing somebody pain.

I'll clean up the patch and make one for the filter_refs case, too.

-Peff

---
diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index b6cc75e..7efcf2f 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -859,25 +859,23 @@ static struct ref *do_fetch_pack(int fd[2],
 	return ref;
 }
 
+static int compare_heads(const void *a, const void *b)
+{
+	return strcmp(*(const char **)a, *(const char **)b);
+}
+
 static int remove_duplicates(int nr_heads, char **heads)
 {
 	int src, dst;
 
-	for (src = dst = 0; src < nr_heads; src++) {
-		/* If heads[src] is different from any of
-		 * heads[0..dst], push it in.
-		 */
-		int i;
-		for (i = 0; i < dst; i++) {
-			if (!strcmp(heads[i], heads[src]))
-				break;
-		}
-		if (i < dst)
-			continue;
-		if (src != dst)
-			heads[dst] = heads[src];
-		dst++;
-	}
+	if (!nr_heads)
+		return 0;
+
+	qsort(heads, nr_heads, sizeof(*heads), compare_heads);
+
+	for (src = dst = 1; src < nr_heads; src++)
+		if (strcmp(heads[src], heads[dst-1]))
+			heads[dst++] = heads[src];
 	return dst;
 }
 
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]