Re: A better approach to diffing and merging

Brian Dessent <brian@xxxxxxxxxxx> · Sat, 29 Nov 2008 17:56:44 -0800

Ian Clarke wrote:

> Provide the merge algorithm with the grammar of the programming
> language, perhaps in the form of a Bison grammar file, or some other
> standardized way to represent a grammar.
> 
> The merge algorithm then uses this to parse the files to be diffed
> and/or merged into trees, and then the diff and merge are treated as
> operations on these trees.  These operations may include creating,
> deleting, or moving nodes or branches, renaming nodes, etc.  There has
> been quite a bit (pdf) of academic research on this topic, although I
> haven't yet found off-the-shelf code that will do what we need.
> Still, it shouldn't be terribly hard to implement.

There's a huge flaw in that approach for C/C++: in order to parse C/C++
you have to first preprocess it -- consider the twisty mazes that
#ifdef/#else/#endif can create.  But in order to preprocess source code
you need a whole heap of extra information that is not in the repository
(or if it is, cannot be automatically extracted.)

For example, you'd have to know all the -D/-U/-I flags that the makefile
or the user might pass to the compiler.  You'd have to replicate the
compiler's complicated header search path algorithm, which can depend on
the directives in the code as well as command line arguments,
environment variables, and values specific to the toolchain.  (Don't
forget that you can have code in a repository that's meant to be
cross-compiled and which uses a toolchain that has its own headers and
not the ones in /usr/include.)  You'd have to know all the built-in
predefined symbols of that toolchain, e.g. what's the value of
__GNUC_MINOR__ or __GNUC_PATCHLEVEL__, is __mips__ or __i386__ defined,
and on and on.  And of course the natural conclusion of this
progression: a change can be perfectly grammatically correct for one
particular platform/toolchain/setting of CFLAGS, and completely broken
for another.  There's no way for a VCS to know any of this, it takes
human comprehension.

If you look at a tool like doxygen that attempts to parse C/C++, it
don't actually do full preprocessing, only a very limited subset: it
only expands macros that the user names as relevant in the config file,
and it only preprocesses included headers that match a pathspec the user
provides.  Consequently it cannot fully parse the code to see if it's
grammatically correct, only to the limited extent that it can infer the
location where things appear to be defined.  And it is easily confused,
e.g. it will "see" code in both halves of an #ifdef section if it wasn't
told anything about the value of the macro in the config file, which can
cause it to incorrectly think that a function or variable was defined
there when in reality that section was discarded.

The idea may have value for langauges that are easy to parse and do not
have all this preprocessor cruft, but I just don't see how it would be
able to provide anything useful for non-trivial changes to real world
C/C++, which require human eyes to decipher.

Brian
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html