Re: [RFC PATCH 0/1] Fuzzy blame

Michael Platings <michael@xxxxxxxxx> · Mon, 25 Mar 2019 23:21:19 +0000

Hi Barret,

> I work on a project that needs a major reformatting, and one thing
> delaying me was the lack of an ability to ignore commits during blame.

I think we understand each other well then - I'm working on a plan to
change the variable naming rule in LLVM, and naturally other
developers aren't keen on making git blame less useful.

> One question I have is under which circumstances do you find that you
> cannot match a suspect/child line to a parent?  One obvious case is a
> commit that only adds lines - the parent's line set is the empty set.  I
> think you catch that earlier in your code (parent_chunk_length == 0),
> though I guess there are other cases where the parent has lines, but
> they are below a certain match threshold?

Yes, exactly. The threshold is currently 0 i.e. a single matching
bigram is all that's required for two lines to be considered matching,
but in future the threshold could be configurable in the same manner
as -M & -C options.
In the t8020-blame-fuzzy.sh test script in my patch, where it's
expected that a line will be attributed to the "ignored" commit you'll
see "Final". So far this is just "}" lines.

> Another question was whether or not you wanted this to apply per commit
> or for an entire blame session.  It looks like your current code applies
> it overall, and not for a specific commit.

This is a really interesting question for this feature. Initially I
just wanted to be able to say "Hey, Git, ignore this revision please."
But then Git says "OK, but how exactly? I can ignore whitespace and I
can detect moves & copies so do you want me to do those?" And then I'm
thinking, actually yes -M10 would be great because I know that this
revision also reordered a bunch of #includes and I still want people
to be able to see where they came from. However other sets of options
might work better for other changes.

On looking at the problem this way it seems that fuzzy matching
belongs in the same class as -w, -M & -C. As these options apply for
an entire blame session, it would be consistent to allow applying the
fuzzy matching likewise. As a bonus, having the ability to apply the
-F option for the entire blame session seems quite nice for other use
cases.

> I'd much prefer it to be per-commit

Yes, we definitely need a way to say "fuzzy match this commit only"
otherwise you lose the ability to detect small but significant changes
in other commits.
I haven't explored this fully, but I'm thinking that the revision
options file might look something like this:

# Just use the defaults, whatever they may be
6e8063eee1d30bc80c7802e94ed0caa8949c6323
# This commit changed loads of tabs to spaces
35ee755a8c43bcb3c2786522d423f006c23d32df -w
# This commit reordered lots of short lines
c5b679e14b761a7bfd6ae93cfffbf66e3c4e25a5 -M5
# This commit copied some stuff and changed CamelCase to snake_case
58b9cd43da695ee339b7679cf0c9f31e1f8ef67f -w -C15 -F

For the command-line, potentially we could make -w/-M/-C/-F specified
after --ignore-rev apply to only that revision e.g.:
git blame myfile --ignore-rev 35ee755a8c43bcb3c2786522d423f006c23d32df -M -F

But as I say, I haven't explored this fully.

> For those cases where you can't find a match, I could imagine marking
> them as unblamable.  The purpose of 'unblamable' in my patchset was to
> signal to not bother looking up further commit info for a final blame
> entry.  It was largely so that the user (me!) wouldn't see a commit
> blamed when I explicitly told git to not tell me about that commit.

I can see how the unblameable concept makes sense for the heuristic
you have right now. However, once you have a heuristic that can tell
you with a high degree of certainty that a line really did come from a
commit that you're merely not interested in, then I suggest that it's
better to just point at that commit.

> Both approaches sound fine though.

:)

> The first thing that comes to mind for me is to plug your fuzzy logic
> into my patch set.

Please do! It should be easy to pluck fuzzy_find_matching_lines() and
its dependencies out. Just to set your expectations, I have not yet
optimised it and it is highly wasteful right now both in terms of time
and memory.

> But maybe we need more info for your function.

The extra info needed is the parent & child file content, plus the
indices in the strings of where new lines start. This information is
already calculated in the course of doing the diff but it looks like a
fair amount of plumbing work will be needed to make the information
available to the chunk callback. That was my reason for initially
plonking the fuzzy matching in a separate pass.

Thanks,
-Michael