A better approach to diffing and merging

"Ian Clarke" <ian.clarke@xxxxxxxxx> · Sat, 29 Nov 2008 12:12:05 -0600

Apologies if this is off-topic, but I recently had an idea for a
better way to do diffs and merging which I thought may be of interest
to those on this list.

I described it in a blog entry here: http://budurl.com/jyt6

For your convenience text is pasted below (although missing a few hyperlinks):

A plan for better source code diffs and merging
======================================

I've been using Subversion for years, but a few months ago I was
really starting to feel the limitations of being able to create and
merge branches easily. I'd heard that Git made this very easy indeed,
and so I decided to try it.

Anyway, this isn't yet another "I discovered Git and now I've achieved
self-actualization" blog post, so to cut a long story short, I now use
git for everything (together with the excellent GitHub).

Even though merging is a lot better with Git than Subversion, I've
still found myself getting into situations where it requires a lot of
work to merge a branch back into another branch, and it got me
thinking about better ways to do merging.

While I'm no merging expert, it seems that most merging algorithms do
it on a line-by-line basis, treating source code as nothing but a list
of lines of text.  It got me thinking, what if the merging algorithm
understood the structure of the source code it is trying to merge?

So the idea is this:

Provide the merge algorithm with the grammar of the programming
language, perhaps in the form of a Bison grammar file, or some other
standardized way to represent a grammar.

The merge algorithm then uses this to parse the files to be diffed
and/or merged into trees, and then the diff and merge are treated as
operations on these trees.  These operations may include creating,
deleting, or moving nodes or branches, renaming nodes, etc.  There has
been quite a bit (pdf) of academic research on this topic, although I
haven't yet found off-the-shelf code that will do what we need.
Still, it shouldn't be terribly hard to implement.

The beauty of this approach is that the merge algorithm should be far
less likely to be confused by formatting changes, and much more likely
to correctly identify what has changed.

I can't think of any reason that such a tool wouldn't work in the
exact same way as existing diff/merge tools from the programmer's
perspective. The tool would automatically select the correct grammar
based on the file extension, or fall-back to line-based diffs if the
extension is unrecognized (or the file isn't valid for the selected
grammar). Thus, it should be trivial to use this new tool with
existing version control systems.

I'd love to have the time to implement this, although regretfully it
is at the bottom of a very large pile of "some day" projects.  I think
this is an interesting enough idea, and one that would be immediately
useful, that if I put it out there someone somewhere might be able to
make it a reality.

Any takers? I've set up a Google Group for further discussion, please
join if interested.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html