Re: [PATCH 1/2] git-commit: Disallow unchanged tree in non-merge mode

Jeff King <peff@xxxxxxxx> · Fri, 14 Sep 2007 03:21:46 -0400

On Thu, Sep 13, 2007 at 10:29:49PM -0700, Linus Torvalds wrote:

> Yeah, we should probably:
>  - default to something larger but still reasonably sane (ie not 100, but 
>    perhaps 1000)
>  - special-case the "identical rename", and have a higher limit for that 
>    (we already handle the identicals as a separate pass before we even 
>    start doing the similarity analysis - and it's the similarity analysis 
>    that can be the really expensive part)

FWIW, my data set is unusably slow at 641 renames (267 of which actually
require similarity analysis). The actual performance of similarity
analysis is going to vary with the size of the files, but you're always
going to have some O(n^2) behavior on that number. So I don't know if
it's worth noting how large the files are and making a cutoff there.

...

Hmm. Actually, doing some profiling, it looks like about 75% of the time
is spent not in the O(n^2) comparison, but in generating the hash
fingerprints of each file.

There seems to be a serious performance problem in diffcore-rename.
There is infrastructure to cache the "cnt_data" member of each filespec,
but it never gets used because we immediately free the filespec data
after use. Oops.

With this patch:

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 6bde439..531a844 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -362,10 +362,7 @@ void diffcore_rename(struct diff_options *options)
 			m->score = estimate_similarity(one, two,
 						       minimum_score);
 			m->name_score = basename_same(one, two);
-			diff_free_filespec_data(one);
 		}
-		/* We do not need the text anymore */
-		diff_free_filespec_data(two);
 		dst_cnt++;
 	}
 	/* cost matrix sorted by most to least similar pair */

My 20-minute diff becomes a 2-minute diff. The downside is that the
memory usage is much increased (for obvious reasons, it should increase
by the dataset size, since we are keeping pointers to the data around --
in my case, around 1G extra).  However, keeping around _just_ the
cnt_data caused only about 100M of extra memory consumption (and gave
the same performance boost).

Of course, 2 minutes for a git-status is still way too slow, so there we
might still need a limiter. It also looks like 57% of my time is spent
in spanhash_find, and another 29% in diffcore_count_changes.

The spanhash data structure is a bit confusing. At first, it looked like
we were doing a linear search for a matching hash, but it's not quite,
since we seem to start at some magic spot based on the hashval we're
looking up.

But it seems to be an array of (hash, count) pairs for each file. Is
there a reason not to use a hash table mapping hash -> count? That would
make insertion and lookup O(1), presumably at the cost of a bit more
memory (since each filespec would have the full table).

Junio, can you shed some light on that decision?

-Peff
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html