[RFC PATCH 0/1] Fuzzy blame

michael@xxxxxxxxx · Sun, 24 Mar 2019 23:50:19 +0000

From: Michael Platings <michael@xxxxxxxxx>

Hi Git devs,

Some of you may be familiar with the git-hyper-blame tool [1]. It's "useful if
you have a commit that makes sweeping changes that are unlikely to be what you
are looking for in a blame, such as mass reformatting or renaming."

git-hyper-blame is useful but (a) it's not convenient to install; (b) it's
missing functionality available in regular git blame; (c) it's method of
matching lines between chunks is too simplistic for many use cases; and
(d) it's not Git so it doesn't integrate well with tools that expect Git
e.g. vim plugins. Therefore I'm hoping to add similar and hopefully superior
functionality to Git itself. I have a very rough patch so I'd like to get your
thoughts on the general approach, particularly in terms of its user-visible
behaviour.

My initial idea was to lift the design directly from git-hyper-blame. However
the approach of picking single revisions to somehow ignore doesn't sit well
with the -w, -M & -C options, which have a similar intent but apply to all
revisions.

I'd like to get your thoughts on whether we could allow applying the -M or -w
options to specific revisions. For example, imagine it was agreed that all
the #includes in a project should be reordered. In that case, it would be useful
to be able to specify that the -M option should be used for blames on that
revision specifically, so that in future when someone wants to know why
a #include was added they don't have to run git blame twice to find out.

Options that are specific to a particular revision could be stored in a
".gitrevisions" file or similar.

If the principle of allowing blame options to be applied per-revision is
agreeable then I'd like to add a -F/--fuzzy option, to sit alongside -w, -M & -C.

I've implemented a prototype "fuzzy" option, patch attached.
The option operates at the level of diff chunks. For each line in the "after"
half of the chunk it uses a heuristic to choose which line in the "before" half
of the chunk matches best. The heuristic I'm using at the moment is of matching
"bigrams" as described in [2]. The initial pass typically gives reasonable
results, but can jumble up the lines. As in the reformatting/renaming use case
the content should stay in the same order, it's worth going to extra effort to
avoid jumbling lines. Therefore, after the initial pass, the line that can be
matched with the most confidence is used to partition the chunk into halves
before and after it. The process is then repeated recursively on the halves
above and below the partition line.
I feel like a similar algorithm has probably already been invented in a better
form - if anyone knows of such a thing then please let me know!

I look forward to hearing your thoughts.
Thanks,
-Michael

[1] https://commondatastorage.googleapis.com/chrome-infra-docs/flat/depot_tools/docs/html/git-hyper-blame.html
[2] https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient

Michael Platings (1):
  Add git blame --fuzzy option.

 blame.c                | 352 +++++++++++++++++++++++++++++++++++++++++++++++--
 blame.h                |   1 +
 builtin/blame.c        |   3 +
 t/t8020-blame-fuzzy.sh | 264 +++++++++++++++++++++++++++++++++++++
 4 files changed, 609 insertions(+), 11 deletions(-)
 create mode 100755 t/t8020-blame-fuzzy.sh

-- 
2.14.3 (Apple Git-98)