Matthieu Moy <Matthieu.Moy@xxxxxxxxxxxxxxx> writes: > Junio C Hamano <gitster@xxxxxxxxx> writes: >> Matthieu Moy <Matthieu.Moy@xxxxxxx> writes: > >> Explanation of '<m>' might want to clarify why it counts only the deletion >> and to mention that "100-similarity != dissimilarity", but as the end-user >> level documentation, these probably are unnecessary. > > The thing is: I don't know the anwser myself, so I'm not in a position > do write such documentation :-(. > ... > Likewise, I didn't write "lines" as a white lie, but because of my > ignorance ... hence my request for help. Sorry, but I actually do not have much more to say than what eeaa460 (diff: Update -B heuristics., 2005-06-03) says. When breaking for the purpose of showing a patch as "total rewrite", what matters is how little the original contents remain in the result. Imagine that you start from a 100-line document and removed 97 lines from it. You then added 27 lines of new material to make a 30-line document or added 997 lines to make a 1000-line document---either way you rewrote the document and how dissimilar the result is relative to the original wouldn't be different in either case. N.B. this is only true as long as there are enough new material in the result---removing 97% without adding anything is not a rewrite. This 97% is "how much did we discard from the original", and it is the number you would see as the "dissimilarity index" ('m' in '-Bn/m'). When breaking, tentatively, for the purpose of rename detection, the amount of the new material starts mattering more. The reason why we try to see if we want to break the pair is exactly because we hope that we may find something similar to the new material in a blob that used to be in but disappeared from another path in the preimage. So we count both deletion and addition to see if the pair has a lot of changes ('n' in '-Bn/m'), which is similar to the way how "similiarity index" used in the "rename" codepath is computed, to decide if we want to tentatively break the pair. Halves of a pair that is tentatively broken, when they do not have a matching rename, are merged back together if they were not total rewrite (i.e. the dissimilarity index for the pair is lower than the threshold 'm'). In either case, the algorithm to compute how much "stuff" was copied from the original and how much "stuff" was added anew to the result is not based on "lines", but based on "bytes". -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html