Jeff King <peff@xxxxxxxx> writes: > The "break" phase works by counting changes between two > blobs with the same path. We do this by splitting the file > into chunks (or lines for text oriented files) and then > keeping a count of chunk hashes. > > The "rename" phase counts changes between blobs at two > different paths. However, it uses the exact same set of > chunk hashes (which are immutable for a given sha1). > > The rename phase can therefore use the same hash data as > break. Unfortunately, we were throwing this data away after > computing it in the break phase. This patch instead attaches > it to the filespec and lets it live through the rename > phase, working under the assumption that most of the time > that breaks are being computed, renames will be too. The change looks correct, but I initially got worried about your change interacts badly with this part in estimate_similarity() around line 176: if (!src->cnt_data && diff_populate_filespec(src, 0)) return 0; if (!dst->cnt_data && diff_populate_filespec(dst, 0)) return 0; I think the reason why it (not your patch but the original code) felt a bit brittle is because the above if() statements know too much about the internal of d-c-c (namely, it never looks at src->data when src->cnt_data is already available, so it is safe to leave src->data NULL). The same logic suggests that the loop to build the matrix in diffcore_rename() could free two->data at the end of the innermost loop. We currently loop this way (around ll. 505-530): for each two (i.e. rename destination candidate): for each one (i.e. rename source candidate): compute and record how similar one and two are free one->data free two->data After computing how similar "one" and something else is only once, we have one->cnt_data so we won't need one->data (the same fact is exploited by your patch for optimization), and that is why freeing one->data in the innermost loop does not result in constant re-reading of the same blob data when we iterate more than one rename destination. But the same logic applies to "two" and we should be able to free the blob data early to reduce the memory pressure. I dunno if it would give us measurable performance benefit, though. Here is the idea on top of your patch, but I think it can be applied independently. diffcore-rename.c | 7 +++++-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/diffcore-rename.c b/diffcore-rename.c index 63ac998..875ca81 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -523,10 +523,13 @@ void diffcore_rename(struct diff_options *options) this_src.dst = i; this_src.src = j; record_if_better(m, &this_src); + /* + * Once we run estimate_similarity, + * We do not need the text anymore. + */ diff_free_filespec_blob(one); + diff_free_filespec_blob(two); } - /* We do not need the text anymore */ - diff_free_filespec_blob(two); dst_cnt++; } -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html