[PATCH 06/15] ref_remove_duplicates(): avoid redundant bisection

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Wed, 23 Oct 2013 17:50:39 +0200

The old code called string_list_lookup(), and if that failed called
string_list_insert(), thus doing the bisection search through the
string list twice in the latter code path.

Instead, just call string_list_insert() right away.  If an entry for
that peer reference name already existed, then its util pointer is
always non-NULL.

Of course this doesn't change the fact that the repeated
string_list_insert() calls make the function scale like O(N^2) if the
input reference list is not already approximately sorted.

Signed-off-by: Michael Haggerty <mhagger@xxxxxxxxxxxx>
---

The O(N^2) algorithm mentioned above essentially boils down to

    for each reference:
        insert reference into string_list (sorted)

Because the insertion requires later array entries to be shifted to
the right, the algorithm scales like O(N^2) if the entries are not
already approximately in order.

I can think of three easy ways to fix this:

* Use a hash map instead of a sorted string_list to record which
  entries have already been seen.

* If the order of the result doesn't matter and it doesn't matter
  *which* of the duplicates is retained, then insert all of the names
  into the string_list unsorted, then sort them all in one go, then
  iterate through looking for duplicates and stringing the survivors
  together in a new linked list.

* If the order is important, or it is important that the *first* of
  duplicates be retained, then iterate through the linked list twice.
  The first time, just add the entries to the string_list.  Then sort
  the string_list using a stable sort (e.g., git_qsort()).  Then
  iterate through the linked list a second time as in the current
  code.

See also my comments to commit 10.

However, I believe that the O(N^2) worst case is unlikely to bite in
most cases.  This function is called from
builtin/fetch.c:get_ref_map(), and the input list is the concatenation
of several sub-lists.  Most sublists come from resolving refspecs, and
the last comes from find_non_local_tags().  I believe that each of the
sublists is sorted.  So the O(N^2) worst-case could only occur if more
than one of the sublists is large and at least two of the large
sublists are not in the correct order relative to each other.

I tried to concoct such a test case by creating a repository with
about 20k branches and 20k tags, instrumenting
ref_remove_duplicates(), and then fetching with something like

    git fetch --prune origin refs/heads/*:refs/remotes/origin/* \
                             refs/tags/*:refs/tags/*

Indeed, the running time of ref_remove_duplicates() became significant
when the order of the refspecs was reversed, but it was still less
than a second out of a much longer command invocation.

So I didn't bother fixing the O(N^2) algorithm.  If somebody reports
real-world slowness here, we can always come back to it.

 remote.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/remote.c b/remote.c
index e9fedfa..a44e897 100644
--- a/remote.c
+++ b/remote.c
@@ -750,13 +750,15 @@ void ref_remove_duplicates(struct ref *ref_map)
 	struct string_list refs = STRING_LIST_INIT_NODUP;
 	struct string_list_item *item = NULL;
 	struct ref *prev = NULL, *next = NULL;
+
 	for (; ref_map; prev = ref_map, ref_map = next) {
 		next = ref_map->next;
 		if (!ref_map->peer_ref)
 			continue;
 
-		item = string_list_lookup(&refs, ref_map->peer_ref->name);
-		if (item) {
+		item = string_list_insert(&refs, ref_map->peer_ref->name);
+		if (item->util) {
+			/* Entry already existed */
 			if (strcmp(((struct ref *)item->util)->name,
 				   ref_map->name))
 				die("%s tracks both %s and %s",
@@ -767,11 +769,9 @@ void ref_remove_duplicates(struct ref *ref_map)
 			free(ref_map->peer_ref);
 			free(ref_map);
 			ref_map = prev; /* skip this; we freed it */
-			continue;
+		} else {
+			item->util = ref_map;
 		}
-
-		item = string_list_insert(&refs, ref_map->peer_ref->name);
-		item->util = ref_map;
 	}
 	string_list_clear(&refs, 0);
 }
-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html