Re: Our merge bases sometimes suck

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 06/13/2014 11:38 AM, Michael J Gruber wrote:
> Michael Haggerty venit, vidit, dixit 13.06.2014 00:12:
>> I've been thinking a lot about merge bases lately and think I have
>> discovered something interesting.
>>
>> tl;dr:
>>
>> When two branches have multiple merge bases,
>>
>>     git merge-base $master $branch
>>
>> picks one merge base more or less arbitrarily.  Here I describe a
>> criterion for picking a "best" merge base.  When the best merge base
>> is used, the diff output by
>>
>>     git diff $master...$branch
>>
>> becomes shorter and simpler--sometimes dramatically so.  I have
>> quantified the improvement by analyzing historical merges from the Git
>> project (see attached image).  Best merge bases might also help reduce
>> conflicts when conducting merges.
>>
> 
> Everytime I looked at our merge base code, my head started spinning. So
> it's admirable you even started this endeavor :)

I haven't looked at the code :-)  My email came only out of reading docs
and experimenting.

> We use merge bases for several things:
> 
> - merging
> - resolving "A...B" rev notation (rev-list and friends), aka symmetric
> difference
> - left/right selection/annotation of commits (rev-list/log)
> - resolving "diff A...B", which may be a handy notation, but is horribly
> misleading because it is a very unsymmetric form of diff
> 
> Out of these four, you seemingly picked the one as pivoting example
> which is most different from any symmetric notion of "...". But in fact,
> "merging" is quite unsymmetric, too, because merging branch into master
> is quite different from the other way round.
> 
> This is certainly obvious to you, bit I thought I point it out for the
> convenience of other readers: You are after optimizing the merge base
> choice for an unsymmetric use case, which covers "diff A...B" as well as
> "merge B into A".
> 
> In these two cases, choosing different merge bases can lead to different
> results. The other two basically use all merge bases anyways.

Correct, the scope of my proposal is for cases when there are multiple
merge bases but one has to be selected.

> [...]
>> I propose that the best merge base is the merge base "candidate" that
>> minimizes the number of non-merge commits that are in
>>
>>     git rev-list --no-merges $candidate..$branch
>>
>> but are already in master:
>>
>>     git rev-list --no-merges $master
>>
>> Since a non-merge commit should embody a single logical change,
>> counting non-merge commits is in some sense counting changes [2].
>>
>> We can put this criterion in simpler form.  Because every candidate is
>> a merge-base,
>>
>>     git rev-list --no-merges $candidate..$branch
>>
>> necessarily includes *all* of the non-merge commits that are on branch
>> but not on master. 
> 
> This is actually true for every merge base candidate (i.e. every commit
> which is both on master and branch).
> 
> [These things are really much easier in set language, A..B being B
> \setminus A, being B intersected with the complement of A etc.]
> 
>> This is a fixed number of commits, the same for
>> every candidate.
> 
> "This" refers to "branch minus master" (commits on branch but not on
> master), which is defined indepently of candidate.
> 
>>  It *additionally* includes the commits that are on
>> master but not yet in branch.  This second number varies from one
>> candidate to another.
> 
> "master minus branch" is independent of candidate also.
> 
> The second number is rather those commits on candidate..branch which are
> not "branch minus master", i.e. "branch minus candidate" minus "branch
> minus master", which is "branch intersected with master minus
> candidate", i.e. those commits on candidate..branch which are also on
> master. [Again this is true for every merge base candidate, not just
> every merge base.]
> 
>>  So if we minimize the number of commits in this
>> output, is is the same as minimizing the number of unwanted commits.
> 
> Now I understand why this statement is true :)

Thanks for the translation to set notation.  It is indeed easier to work
with.

>> Therefore, to get the best merge base, all we have to do is pick the
>> merge base that minimizes
>>
>>     git rev-list --count --no-merges $candidate..$branch
>>
>> There can be ties, but in practice they are rare enough that it is
>> probably not worth worrying about them.
>>
>>
>> Symmetry; generalization to more than two branches
>> ==================================================
>>
>> Interestingly, minimizing the last criterion is the same as maximizing
>>
>>     git rev-list --count --no-merges $candidate
>>
>> because there is a fixed number of commits in
>>
>>     git rev-list --no-merges $branch
>>
>> , and each of those commits is in exactly one of the two counts above.
> 
> That's a cute observation, which in the example above is wrong on first
> glance, but true on second: Visually/subjectively, one easily misses
> commits B and C when counting "X..master", which is not just the commits
> "extending master beyond X", etc.
> 
>> This formulation shows that the best merge base for computing
>>
>>     git diff $master...$branch
>>
>> is also the best merge base for computing
>>
>>     git diff $branch...$master
>>
>> ; i.e., the best merge base is symmetric in its arguments.  It also
>> shows that the concept of "best merge base" can trivially be
>> generalized to more than two branches.
> 
> That symmetry is really surprising, but the argument is convincingly
> correct. Alas, that count back to the roots can be really expensive.

I mentioned the formulation

    git rev-list --count --no-merges $candidate

not because it is necessarily the most efficient way to find the best
merge base, but because it is theoretically useful.  For the
implementation, any of the formulations can be used, because they are
all equivalent.

> Given that the arguments in the previous argument show that everything
> applies to merge base candidates as well, not just merge bases, I'm
> wondering whether we can weave in the weighing (which one is best) with
> our existing merge base algorithm or maybe an alternative algorithm
> using generation numbers or such.
> [...]

That's an interesting idea.  I haven't looked at that code at all, so if
you want to get involved it would be awesome!

Michael

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]