Re: in_merge_bases() is too expensive for recent "pu" update

Jeff King <peff@xxxxxxxx> · Fri, 24 Aug 2012 11:15:11 -0400

On Fri, Aug 24, 2012 at 11:43:40AM +0200, Thomas Rast wrote:

> > Start from A and B.  Follow from B to find 'x' and paint it in blue,
> > follow from A to find 'y' and paint it in amber.  Follow from 'x' to
> > '1', paint it in blue.  Follow from 'y' to '1', paint it in amber
> > but notice that it already is painted in blue.
> [...]
> >             o-------o
> >            /         \
> >           /       y---A
> >          /       /
> >      ---2---z---1---x---B
> >          \         /
> >           o-------o
> [...]
> > So we need to notice that '1' and '2' have ancestry relation in
> > order to reject '2' as "common but not merge-base".  One way of
> > doing so is not to stop at '1' and keep digging (and eventually we
> > find that '2' is what we could reach from '1' that already is a
> > merge base), but then we will be susceptible to the same kind of
> > clock skew issue as the revision traverser.
> 
> I think that is *the* way to do it.
> 
> And the way to fix the clock skew issue (also in the revision traversal)
> is Peff's generation number cache.  Just keep propagating marks, digging
> in generation order, until the generations prove that nothing new can
> happen.

I thought you were just interested in speeding up is_descendent_of. You
should be able to do that without a generation number. Just start from A
and B as above, do the two-color painting, and do not add the parents of
any two-color commits (because you know they are ancestors of both, and
therefore you cannot find either by looking backwards). If you paint "B"
amber, then it is a descendent of "A" (and vice versa if you paint "A"
blue).

Clock skew may make you take longer (because you may go depth-first down
an uninteresting chain of commits), but it should never give you the
wrong answer. It's not as fast as using a generation-based cutoff
(because you have to keep walking to the actual shared ancestor), but in
practice it's usually fine.

The reason I did not do that for "tag --contains" is that I wanted to
do a simultaneous walk for all tags. That would require N color bits for
N tags, and we have a limited space in each commit object. I didn't time
using a separate hash table to store those bits outside of the commit
objects. That would have a higher constant, but should still yield good
big-O complexity.

>   [Side note, in reply to an earlier comment in the rev-list thread: I
>   agree with Peff's reasoning that a cache is better than a commit
>   header.]

I still think it's a good idea to keep it out of the commit header. But
I think I might lean towards tying it to the pack index (i.e.,
generating it when we generate the index, and depending on the sha1
table in the index). That would be smaller, and gets rid of any
complexity or performance issues with making incremental updates to the
cache. Loose objects wouldn't have a cached version number, but that's
OK; in practice you can quickly walk backwards to a commit that _is_
cached.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html