Default history simplification

Tommy Wang <subscription@xxxxxxxxxxx> · Thu, 19 Nov 2009 17:30:43 -0600

Hi all,

I'm working on a small perl script that needs to reproduce the results
of the default git history simplification used by log & rev-list when
a list of paths is given.  It is important that I reproduce its
results exactly.  I would love to simply use the rev-list built-in to
do my work for me; but I fear that I may have much too many path
limiters than the linux command-line can handle (which if I'm correct,
can only take so many arguments).

I can understand whether a commit is included or not rather easily: we
basically only include a commit iff the commit introduces a change to
the given path (ie, it is !TREESAME to any of its parents).

For merges, I get a little confused.  In the simple case of a 2-parent
merge, if we find that we are TREESAME with one parent -- it follows
that the interesting change must have come from that parent.  So we
follow it, and can safely assume that the other parent is
uninteresting.

If we find that we are TREESAME with both parents, it follows that
neither parent nor the merge commit is interesting.  Therefore, we
randomly pick the first parent and move up the graph looking for an
interesting commit.  Looking at this more closely, the parents we
ignored will fall under three scenarios:

1. It finds it way all the way up to a root commit without finding an
interesting commit -- which can happen if your repository has multiple
root commits.
2. It will move up the graph and eventually find a common ancestor
with the first parent.
3. It will find an interesting commit that has an identical/equivalent
commit somewhere in the first parent's ancestry.

In all 3 cases, it is safe to follow just the first TREESAME parent.
It is interesting to note that if are working with a repository with
multiple root commits (say, you imported a project without history,
and later merged in its upstream history), you could potentially lose
that history with this algorithm if you happened to merge it as the
second parent.  This is not necessarily a flaw (in fact, it may be a
feature!), since you have still fully explained the state of your HEAD
with respect to the first parent (which is probably your mainline).

In the case that neither parent was TREESAME, we can have 2 scenarios:

1. The merge commit itself made the interesting change.
2. The interesting changes came from both parents, in which we would
rightfully follow both parents.

My first question is this:

If the merge commit itself (D) made the interesting change, we could
potentially follow an uninteresting parent -- most likely all the way
up until we find a common ancestor (A).  This would produce a graph
that looks like:

A---B---D
 \-------/ (C pruned)

Or, if both parents were uninteresting:

A---D (both B & C pruned)
 \---/

I assume that the simplification takes care of this by removing
duplicate parents as well as searching for common ancestors?  (It is
not mentioned in the documentation).

My second question is then:

Given that we perform an extra simplification pass to remove common
ancestors and duplicate parents, what is the purpose of selectively
following parents?  Is it just for speed?  If we always followed all
parents and applied our duplicate simplification and common ancestor
simplification, we would not always arrive at the same result?  In
which case, if my application is not concerned with speed (ie, I don't
mind following all parents), would I always arrive at the same graph?

Thanks,

Tommy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html