Hi Philip, On Wed, 9 Oct 2019, Philip Oakley wrote: > On 08/10/2019 13:46, Johannes Schindelin wrote: > > Hi Junio, > > > > On Tue, 8 Oct 2019, Junio C Hamano wrote: > > > > > Johannes Schindelin <Johannes.Schindelin@xxxxxx> writes: > > > > > > > > I didn't quite understand this part, though. > > > > > > > > > > The default creation factor is 60 (roughly speaking, it wants 60% > > > > > of > > > > > the lines to match between two patches, otherwise it considers > > > > > the > > > > > patches to be unrelated). > > > > > > > > > > Would the updated creation factor used which is 95 (roughly > > > > > speaking) want 95% of the lines to match between two patches? > > > > > > > > > > That would make the matching logic even pickier and reject more > > > > > paring, so I must be reading the statement wrong X-<. > > > > No, I must have written the opposite of what I tried to say, is all. > > > So, cfactor of 60 means at most 60% is allowed to differ and the > > > two patches are still considered to be related, while 95 means only > > > 5% needs to be common? That would make more sense to me. > > Okay, I not only wrote the opposite of what I wanted to say, I also > > misremembered. > > > > When `range-diff` tries to determine matching pairs of patches, it > > builds an `(m+n)x(m+n)` cost matrix, where `m` is the number of patches > > in the first commit range and `n` is the number of patches in the second > > one. > > > > Why not `m x n`? Well, that's the obvious matrix, and that's what it > > starts with, essentially assigning the number of lines of the diff > > between the diffs as "cost". > > > > But then `git range-diff` extends the cost matrix to allow for _all_ of > > the `m` patches to be considered deleted, and _all_ of the `n` patches > > to be added. As cost, it cannot use a "diff of diffs" because there is > > no second diff. So it uses the number of lines of the one diff it has, > > multiplied by the creation factor interpreted as a percentage. > > > > The naive creation factor would be 100%, which is (almost) as if we > > assumed an empty diff for the missing diff. But that would make the > > range-diff too eager to dismiss rewrites, as experience obviously showed > > (not my experience, but Thomas Rast's, who came up with `tbdiff` after > > all): the diff of diffs includes a diff header, for example. > > > > The interpretation I offered (although I inverted what I wanted to say) > > is similar in spirit to that metric (which is not actually a metric, I > > believe, because I expect it to violate the triangle inequality) is > > obviously inaccurate: the number of lines of the diff of diffs does not > > say anything about the number of matching lines, quite to the contrary, > > it correlates somewhat to the number of non-matching lines. > > > > So a better interpretation would have been: > > > > The default creation factor is 60 (roughly speaking, it wants at > > most 60% of the diffs' lines to differ, otherwise it considers > > them not to be a match. > > > > This is still inaccurate, but at least it gets the idea of the > > range-diff across. > > > > Of course, I will never be able to amend the commit message in > > GitGitGadget anyway, as I have merged that PR already. > > > > Ciao, > > Dscho > Medium term, is this something that could go in the algorithms section of the > range-diff man page, especially if the upstream commit message is already in > place. > > #leftoverdocs ? Sure. How about giving it a try while our memory is still fresh? You would help me immensely if you could take that task off of my plate... Thanks, Dscho