Re: [PATCH 2/2] diffcore-break: save cnt_data for other phases

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 16 Nov 2009 13:20:00 -0800

Jeff King <peff@xxxxxxxx> writes:

> The "break" phase works by counting changes between two
> blobs with the same path. We do this by splitting the file
> into chunks (or lines for text oriented files) and then
> keeping a count of chunk hashes.
>
> The "rename" phase counts changes between blobs at two
> different paths. However, it uses the exact same set of
> chunk hashes (which are immutable for a given sha1).
>
> The rename phase can therefore use the same hash data as
> break. Unfortunately, we were throwing this data away after
> computing it in the break phase. This patch instead attaches
> it to the filespec and lets it live through the rename
> phase, working under the assumption that most of the time
> that breaks are being computed, renames will be too.

The change looks correct, but I initially got worried about your change
interacts badly with this part in estimate_similarity() around line 176:

	if (!src->cnt_data && diff_populate_filespec(src, 0))
		return 0;
	if (!dst->cnt_data && diff_populate_filespec(dst, 0))
		return 0;

I think the reason why it (not your patch but the original code) felt a
bit brittle is because the above if() statements know too much about the
internal of d-c-c (namely, it never looks at src->data when src->cnt_data
is already available, so it is safe to leave src->data NULL).

The same logic suggests that the loop to build the matrix in
diffcore_rename() could free two->data at the end of the innermost loop.

We currently loop this way (around ll. 505-530):

	for each two (i.e. rename destination candidate):
        	for each one (i.e. rename source candidate):
                	compute and record how similar one and two are
			free one->data
		free two->data

After computing how similar "one" and something else is only once, we have
one->cnt_data so we won't need one->data (the same fact is exploited by
your patch for optimization), and that is why freeing one->data in the
innermost loop does not result in constant re-reading of the same blob
data when we iterate more than one rename destination.  But the same logic
applies to "two" and we should be able to free the blob data early to
reduce the memory pressure.

I dunno if it would give us measurable performance benefit, though.

Here is the idea on top of your patch, but I think it can be applied
independently.

 diffcore-rename.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 63ac998..875ca81 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -523,10 +523,13 @@ void diffcore_rename(struct diff_options *options)
 			this_src.dst = i;
 			this_src.src = j;
 			record_if_better(m, &this_src);
+			/*
+			 * Once we run estimate_similarity, 
+			 * We do not need the text anymore.
+			 */
 			diff_free_filespec_blob(one);
+			diff_free_filespec_blob(two);
 		}
-		/* We do not need the text anymore */
-		diff_free_filespec_blob(two);
 		dst_cnt++;
 	}
 
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html