[PATCH] diffcore-rename: cache file deltas

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We find rename candidates by computing a fingerprint hash of
each file, and then comparing those fingerprints. There are
inherently O(n^2) comparisons, so it pays in CPU time to
hoist the (rather expensive) computation of the fingerprint
out of that loop (or to cache it once we have computed it once).

Previously, we didn't keep the filespec information around
because then we had the potential to consume a great deal of
memory. However, instead of keeping all of the filespec
data, we can instead just keep the fingerprint.

This patch implements and uses diff_free_filespec_data_large
to accomplish that goal. We also have to change
estimate_similarity not to needlessly repopulate the
filespec data when we already have the hash.

Practical tests showed 4.5x speedup for a 10% memory usage
increase.

Signed-off-by: Jeff King <peff@xxxxxxxx>
---
The implementation is a little less nice than I would like, but I was
trying to be non-invasive. Specifically:

 - the name diff_free_filespec_data_large is horrible, but this is based
   on the fact that diff_free_filespec_data actually does too much (it
   frees the data _and_ some other auxiliary data). And renaming that
   would entail changing many callsites.

 - It seems that a better place to call diffcore_populate_filespec
   (rather than in estimate_similarity) would actually be in
   diffcore_count_changes (when we _know_ that we need to populate the
   contents data).

 - The hash_chars() should arguably be tied into
   diffcore_populate_filespec, which should have more of a "what
   information do you want" interface. I.e., the "size_only" parameter
   could be extended to a bitfield to say "populate this, and I need the
   delta fingerprint, size, actual contents, etc". Then callers could
   just use "populate" before looking at the filespec and it would
   lazily load whatever they needed.

This patch cuts my pathological case from 20 minutes to 2 minutes, which
is a great improvement, but still unusable. However, now I should be
able to get more useful numbers on what else can be sped up.

 diff.c            |    7 ++++++-
 diffcore-rename.c |    7 ++++---
 diffcore.h        |    1 +
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/diff.c b/diff.c
index 2216d75..647bcf0 100644
--- a/diff.c
+++ b/diff.c
@@ -1654,7 +1654,7 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only)
 	return 0;
 }
 
-void diff_free_filespec_data(struct diff_filespec *s)
+void diff_free_filespec_data_large(struct diff_filespec *s)
 {
 	if (s->should_free)
 		free(s->data);
@@ -1665,6 +1665,11 @@ void diff_free_filespec_data(struct diff_filespec *s)
 		s->should_free = s->should_munmap = 0;
 		s->data = NULL;
 	}
+}
+
+void diff_free_filespec_data(struct diff_filespec *s)
+{
+	diff_free_filespec_data_large(s);
 	free(s->cnt_data);
 	s->cnt_data = NULL;
 }
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 41b35c3..4fc2000 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -184,7 +184,8 @@ static int estimate_similarity(struct diff_filespec *src,
 	if (base_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
 		return 0;
 
-	if (diff_populate_filespec(src, 0) || diff_populate_filespec(dst, 0))
+	if ((!src->cnt_data && diff_populate_filespec(src, 0))
+		|| (!dst->cnt_data && diff_populate_filespec(dst, 0)))
 		return 0; /* error but caught downstream */
 
 
@@ -377,10 +378,10 @@ void diffcore_rename(struct diff_options *options)
 			m->score = estimate_similarity(one, two,
 						       minimum_score);
 			m->name_score = basename_same(one, two);
-			diff_free_filespec_data(one);
+			diff_free_filespec_data_large(one);
 		}
 		/* We do not need the text anymore */
-		diff_free_filespec_data(two);
+		diff_free_filespec_data_large(two);
 		dst_cnt++;
 	}
 	/* cost matrix sorted by most to least similar pair */
diff --git a/diffcore.h b/diffcore.h
index eef17c4..4bf175b 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -48,6 +48,7 @@ extern void fill_filespec(struct diff_filespec *, const unsigned char *,
 
 extern int diff_populate_filespec(struct diff_filespec *, int);
 extern void diff_free_filespec_data(struct diff_filespec *);
+extern void diff_free_filespec_data_large(struct diff_filespec *);
 extern int diff_filespec_is_binary(struct diff_filespec *);
 
 struct diff_filepair {
-- 
1.5.3.2.1061.gc056e-dirty
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux