Re: commit-graph: change in "best" merge-base when ambiguous

Derrick Stolee <stolee@xxxxxxxxx> · Tue, 22 May 2018 08:48:04 -0400

On 5/22/2018 1:39 AM, Michael Haggerty wrote:
On 05/21/2018 08:10 PM, Derrick Stolee wrote:
[...]
In the Discussion section of the `git merge-base` docs [1], we have the
following:

     When the history involves criss-cross merges, there can be more than
one best common ancestor for two commits. For example, with this topology:

     ---1---o---A
         \ /
          X
         / \
     ---2---o---o---B

     both 1 and 2 are merge-bases of A and B. Neither one is better than
the other (both are best merge bases). When the --all option is not
given,     it is unspecified which best one is output.

This means our official documentation mentions that we do not have a
concrete way to differentiate between these choices. This makes me think
that this change in behavior is not a bug, but it _is_ a change in
behavior. It's worth mentioning, but I don't think there is any value in
making sure `git merge-base` returns the same output.

Does anyone disagree? Is this something we should solidify so we always
have a "definitive" merge-base?
[...]
This may be beyond the scope of what you are working on, but there are
significant advantages to selecting a "best" merge base from among the
candidates. Long ago [1] I proposed that the "best" merge base is the
merge base candidate that minimizes the number of non-merge commits that
are in

     git rev-list $candidate..$branch

that are already in master:

     git rev-list $master

(assuming merging branch into master), which is equivalent to choosing
the merge base that minimizes

     git rev-list --count $candidate..$branch

In fact, this criterion is symmetric if you exchange branch ↔ master,
which is a nice property, and indeed generalizes pretty simply to
computing the merge base of more than two commits.

In that email I also included some data showing that the "best" merge
base almost always results in either the same or a shorter diff than the
more or less arbitrary algorithm that we currently use. Sometimes the
difference in diff length is dramatic.

To me it feels like the best *deterministic* merge base would be based
on the above criterion, maybe with first-parent reachability, commit
times, and SHA-1s used (in that order) to break ties.

Thanks, everyone, for your perspective on this. I'm walking away with 
these conclusions:

1. While this is a change in behavior, it is not a regression. We do not 
need to act immediately to preserve old behavior in these ambiguous cases.

2. We should (eventually) define tie-breaking conditions. I like 
Michael's suggestion above.

Thanks,
-Stolee