Re: [PATCH v6 0/6] blame: add the ability to ignore commits

Barret Rhoden <brho@xxxxxxxxxx> · Mon, 15 Apr 2019 09:23:37 -0400

Hi Michael -

On 4/14/19 5:10 PM, Michael Platings wrote:
Hi Barret,

This works pretty well for the typical reformatting use case now. I've
run it over every commit of every .c file in the git project root,
both forwards and backwards with every combination of -w/-M/-C and
can't get it to crash so I think it's good in that respect.

However, it can still attribute lines to the wrong parent line. See
https://pypi.org/project/autopep8/#usage for an example reformatting
that it gets a bit confused on. The patch I submitted handles this
case correctly because it uses information about the more similar
lines to decide how more ambiguous lines should be matched.

Yeah - I ran your tests against it and noticed a few cases weren't handled.

You also gave an example of:

         commit-a 11) void new_func_1(void *x, void *y);
         commit-b 12) void new_func_2(void *x, void *y);

Being reformatted to:

         commit-a 11) void new_func_1(void *x,
         commit-b 12)                 void *y);
         commit-b 13) void new_func_2(void *x,
         commit-b 14)                 void *y);

The patch I submitted handles this case correctly, assigning line 12
to commit-a because it scales the parent line numbers according to the
relative diff chunk sizes instead of assuming a 1-1 mapping.

So I do ask that you incorporate more of my patch, including the test
code. It is more complex but I hope this demonstrates that there are
reasons for that. Happy to provide more examples or explanation if it
would help. On the other hand if you have examples where it falls
short then I'd be interested to know.

My main concerns:
- Can your version reach outside of a diff chunk?  such as in my "header 
moved" case.  That was a simplified version of something that pops up in 
a major file reformatting of mine, where a "return 0;" was matched as 
context and broke a diff chunk up into two blame_chunk() calls.  I tend 
to think of this as the "split diff chunk."

- Complexity and possibly performance.  The recursive stuff made me 
wonder about it a bit.  It's no reason not to use it, just need to check 
it more closely.

Is the latest version of your stuff still the one you posted last week 
or so?  If we had a patch applied onto this one with something like an 
ifdef or a dirt-simple toggle, we can play with both of them in the same 
codebase.

Similarly, do you think the "two pass" approach I have (check the chunk, 
then check the parent file) would work with your recursive partitioning 
style?  That might make yours able to handle the "split diff chunk" case.

The other major use case that I'm interested in is renaming. In this
case, the git-hyper-blame approach of mapping line numbers 1-1 works
perfectly. Here's an example. Before:

         commit-a 11) Position MyClass::location(Offset O) {
         commit-b 12)    return P + O;
         commit-c 13) }

After:

         commit-a 11) Position MyClass::location(Offset offset) {
         commit-a 12)    return position + offset;
         commit-c 13) }

With the fuzzy matching, line 12 gets incorrectly matched to parent
line 11 because the similarity of "position" and "offset" outweighs
the similarity of "return". I'm considering adding even more
complexity to my patch such that parts of a line that have already
been matched can't be matched again by other lines.

But the other possibility is that we let the user choose the
heuristic. For a commit where they know that line numbers haven't
changed they could choose 1-1 matching, while for a reformatting
commit they could use fuzzy matching. I welcome your thoughts.

No algorithm will work for all cases.  The one you just gave had the 
simple heuristic working better than a complex one.  We could make it 
more complex, but then another example may be worse.  I can live with 
some inaccuracy in exchange for simplicity.

I ran into something similar with the THRESHOLD #defines.  You want it 
to be able to match certain things, but not other things.  How similar 
does something have to be?  Should it depend on how far away the 
matching line is from the source line?  I went with a "close enough is 
good enough" approach, since we're marking with a '*' or something 
anyways, so the user should know to not trust it 100%.

Thanks,

Barret