Re: [RFH] git cherry vs. git rev-list --cherry, or: Why does "..." suck?

Michael J Gruber <git@xxxxxxxxxxxxxxxxxxxx> · Thu, 24 Mar 2011 08:38:48 +0100

Junio C Hamano venit, vidit, dixit 23.03.2011 19:20:
> Michael J Gruber <git@xxxxxxxxxxxxxxxxxxxx> writes:
> 
>> Adding some recent insight:
>>
>> Michael J Gruber venit, vidit, dixit 22.03.2011 13:07:
>>> Performance
>>> ===========
>>>
>>> I don't get this:
>>>
>>> git cherry A B: 0.4s
>>> git rev-list --cherry A...B: 1.7s
>>> (more details below)
>>
>> I can get the latter down to 0.95s and this
>>
>>> merge-base A B: 0.95s
>>> merge-base --all A B: 0.95s
>>> rev-parse A...B: 0.95s
>>
>> to 0.16s each. The downside is that merge-base may give a few
>> unneccessary candidates (commits which are ancestors of other commits it
>> returns), but this does not change the results for rev-list, of course.
>>
>> I get this dramatic speedup by removing the check for duplicates from
>> get_merge_bases_many() in commit.c. After a first merge_bases_many()
>> run, returning N commits, that check calls merge_bases_many() again for
>> each pair (N choose 2) to check whether one is contained in the other.
>> Quite a bottleneck. Removing it works great. But can we live with a few
>> additional merge bases?
> 
> When we run merge-base as the top-level command (this includes
> reduce_heads() that is used by "git merge"), we have to cull unnecessary
> phantom bases that can be reached by other bases, so you are not allowed
> to make such a change unconditionally.
> 
> Passing down a parameter from a caller that is prepared to handle phantom
> merge bases correctly is probably the right approach.  Existing callers
> can make "safer" calls for now; you can later examine them and turn them
> into "faster" calls if they operate correctly given a result that contain
> phantom bases.

Yes, I was thinking of having thorough vs. fast mode, but I'll dig more
into merge_bases_many().

My current impression is that those phantom merge bases appear only
(mainly?) when there are severe date ordering problems in the dag.
merge_bases_many() uses commit_list_insert_by_date(), and given the way
it walks, later merge bases which are ancestors of another one get
marked STALE automatically.

For callers which are interested in one base only, the check makes a
difference only if there are date ordering problems.

In my A B example, merge_bases_many() comes up with the correct 25 ones
during its first run, and then gets called 300 times again (25*24/2) to
check each pair of them, without any reduction. (Clearly, it's a non
issue in the case of unique merge base.)

What I'm mainly interested in is the A...B case. And for a revision walk
with one A...B range and revs->limited, one could even embed the whole
mb-logic into the walk! But I don't see how to do this for multiple
symmetric ranges.

Michael
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html