Re: commit-graph: change in "best" merge-base when ambiguous

Jakub Narebski <jnareb@xxxxxxxxx> · Fri, 25 May 2018 00:08:28 +0200

Derrick Stolee <stolee@xxxxxxxxx> writes:
> On 5/22/2018 1:39 AM, Michael Haggerty wrote:
>> On 05/21/2018 08:10 PM, Derrick Stolee wrote:
>>> [...]
>>> In the Discussion section of the `git merge-base` docs [1], we have the
>>> following:
>>>
>>>      When the history involves criss-cross merges, there can be more than
>>> one best common ancestor for two commits. For example, with this topology:
>>>
>>>      ---1---o---A
>>>          \ /
>>>           X
>>>          / \
>>>      ---2---o---o---B
>>>
>>>      both 1 and 2 are merge-bases of A and B. Neither one is better than
>>> the other (both are best merge bases). When the --all option is not
>>> given, it is unspecified which best one is output.
>>>
>>> This means our official documentation mentions that we do not have a
>>> concrete way to differentiate between these choices. This makes me think
>>> that this change in behavior is not a bug, but it _is_ a change in
>>> behavior. It's worth mentioning, but I don't think there is any value in
>>> making sure `git merge-base` returns the same output.
>>>
>>> Does anyone disagree? Is this something we should solidify so we always
>>> have a "definitive" merge-base?
>>> [...]
>> This may be beyond the scope of what you are working on, but there are
>> significant advantages to selecting a "best" merge base from among the
>> candidates. Long ago [1] I proposed that the "best" merge base is the
>> merge base candidate that minimizes the number of non-merge commits that
>> are in
>>
>>      git rev-list $candidate..$branch
>>
>> that are already in master:
>>
>>      git rev-list $master
>>
>> (assuming merging branch into master), which is equivalent to choosing
>> the merge base that minimizes
>>
>>      git rev-list --count $candidate..$branch

Is the above correct...

>> In fact, this criterion is symmetric if you exchange branch ↔ master,
>> which is a nice property, and indeed generalizes pretty simply to
>> computing the merge base of more than two commits.

...as it doesn't seem to have the described symmetry.

>>
>> In that email I also included some data showing that the "best" merge
>> base almost always results in either the same or a shorter diff than the
>> more or less arbitrary algorithm that we currently use. Sometimes the
>> difference in diff length is dramatic.
>>
>> To me it feels like the best *deterministic* merge base would be based
>> on the above criterion, maybe with first-parent reachability, commit
>> times, and SHA-1s used (in that order) to break ties.
>
> Thanks, everyone, for your perspective on this. I'm walking away with
> these conclusions:
>
> 1. While this is a change in behavior, it is not a regression. We do
> not need to act immediately to preserve old behavior in these
> ambiguous cases.
>
> 2. We should (eventually) define tie-breaking conditions. I like
> Michael's suggestion above.

One thing I'd like to point out is that when searching for some
algorithm to speed up merge-base calculation (which is called lowest
common ancestor in graph theory, and for which I have currently found
only an algorithm with O(|V|*{E|) preparation time, and U(1) query)
I have found instead attempts to rigorously define single representative
lowest common ancestor.  It might be worth a look how it is done.

Another possible source to compare against is the algorithm used by
Mercurial (which as far as I know doesn't use recursive merge strategy,
so it needs to chose one merge base).

HTH,
-- 
Jakub Narębski