Re: [RFC/PATCH] Document -B<n>[/<m>], -M<n> and -C<n> variants of -B, -M and -C

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 30 Jul 2010 09:42:01 -0700

Matthieu Moy <Matthieu.Moy@xxxxxxxxxxxxxxx> writes:

> Junio C Hamano <gitster@xxxxxxxxx> writes:
>> Matthieu Moy <Matthieu.Moy@xxxxxxx> writes:
>
>> Explanation of '<m>' might want to clarify why it counts only the deletion
>> and to mention that "100-similarity != dissimilarity", but as the end-user
>> level documentation, these probably are unnecessary.
>
> The thing is: I don't know the anwser myself, so I'm not in a position
> do write such documentation :-(.
> ...
> Likewise, I didn't write "lines" as a white lie, but because of my
> ignorance ... hence my request for help.

Sorry, but I actually do not have much more to say than what eeaa460
(diff: Update -B heuristics., 2005-06-03) says.

When breaking for the purpose of showing a patch as "total rewrite", what
matters is how little the original contents remain in the result.  Imagine
that you start from a 100-line document and removed 97 lines from it.  You
then added 27 lines of new material to make a 30-line document or added
997 lines to make a 1000-line document---either way you rewrote the
document and how dissimilar the result is relative to the original
wouldn't be different in either case.  N.B. this is only true as long as
there are enough new material in the result---removing 97% without adding
anything is not a rewrite.  This 97% is "how much did we discard from the
original", and it is the number you would see as the "dissimilarity index"
('m' in '-Bn/m').

When breaking, tentatively, for the purpose of rename detection, the
amount of the new material starts mattering more.  The reason why we try
to see if we want to break the pair is exactly because we hope that we may
find something similar to the new material in a blob that used to be in
but disappeared from another path in the preimage.  So we count both
deletion and addition to see if the pair has a lot of changes ('n' in
'-Bn/m'), which is similar to the way how "similiarity index" used in the
"rename" codepath is computed, to decide if we want to tentatively break
the pair.  Halves of a pair that is tentatively broken, when they do not
have a matching rename, are merged back together if they were not total
rewrite (i.e. the dissimilarity index for the pair is lower than the
threshold 'm').

In either case, the algorithm to compute how much "stuff" was copied from
the original and how much "stuff" was added anew to the result is not
based on "lines", but based on "bytes".

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html