Re: [PATCH v2 0/2] Optimization batch 6: make full use of exact renames

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 03 Feb 2021 13:56:21 -0800

"Elijah Newren via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> This series depends on en/merge-ort-perf and makes full use of exact
> renames; see commit messages for details.
>
> Thanks to Stolee and Junio for reviewing v1.
>
> Changes since v1:
>
>  * Update rename_src_nr when updating rename_src
>  * Introduce want_copies in the first patch and use it in a few more places
>  * Move a comment below a few exit-early if-checks.
>
> Elijah Newren (2):
>   diffcore-rename: no point trying to find a match better than exact
>   diffcore-rename: filter rename_src list when possible
>
>  diffcore-rename.c | 69 +++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 61 insertions(+), 8 deletions(-)

Thanks, these look bettrer.

With these changes, I guess there are only two things I find myself
somewhat embarrassing in the rename machinery that is still there
since I invented it.

 - We still need to go full matrix while finding the "best"
   pairing.  I cannot think of a way to avoid it (that is what makes
   it embarrassing) but wish there were some way to.

   In an early attempt, I tried to retire rename_src[j], once
   rename_dst[i] has been found to be a "good enough" match for it,
   from the pool of rename src candidates to find a good match for
   rename_dst[k] for i < k, but naive implementation of it would not
   work well for obvious reasons---rename_src[j] may match a lot
   better with rename_dst[k] than rename_dst[i] but we do not know
   that until we try to estimate similarity with rename_dst[k].

 - The .cnt_data member was designed to be a concise summary of the
   blob characteristics so that two .cnt_data can be "compared"
   fairly cheaply to see how "similar" two blobs are [*], but (1) it
   is rather big to be called a "concise summary", and (2) it was
   not chosen after real performance measurement, and we've been
   using it for the past 15 years without revisiting its design.

   Side note: In a very early prototype, the approach to assess
   similarity between two blobs was very different---there was no
   attempt to compute "concise summary" for each blob, but we just
   attempted to create delta (as in the pack data) between src and
   dst blobs and measured how small a delta we can use to transform
   from src to dst.