Re: [RFC/PATCH 1/4] tag: speed up --contains calculation

Jonathan Nieder <jrnieder@xxxxxxxxx> · Wed, 13 Oct 2010 17:07:53 -0500

Hi,

Jeff King wrote:

> When we want to know if commit A contains commit B (or any
> one of a set of commits, B through Z),

Let's revive this old thread.  Sorry for the long silence (the last
time this topic was visited was a couple months ago [1] afaict).

> Instead, let's leverage the fact that we are going to use
> the same --contains list for each tag, and mark areas of the
> commit graph is definitely containing those commits, or
> definitely not containing those commits. Later tags can then
> stop traversing as soon as they see a previously calculated
> answer.

Makes sense.  And the resulting timings are exciting.

> ---
> Note that this is a depth first search, whereas we generally traverse in
> a more breadth-first way. So it can actually make things slightly slower
> than the current merge-base code, if:
> 
>   1. You don't have any merge base calculations that involve going very
>      far back in history.
> 
>   2. We depth-first down the wrong side of a merge.
> 
> However, in the usual cases, I think it will perform much better.

Nit: Wouldn't this (or a summary of it) belong in the log message, too,
to avoid headscratching when a person tries to bisect a performance
regression down the line?

>   1. This can probably be a one-liner change to use in "git branch
>      --contains", as well. I didn't measure it, as that already tends to
>      be reasonably fast.

These days "branch -r --contains" has been _sloow_ for me.  So once
this is tuned, that wouldn't be a bad idea, methinks.

>   2. It uses commit marks, which means it doesn't behave well with other
>      traversals. I have no idea if I should be using alternate marks or
>      not.

This is what Junio was referring to with "the use of TMP_MARK smelled
somewhat bad to me", right?

>   3. We never clear the marks, or check them against the "want" list.

This too, I suppose.

> So
>      we just assume that repeated calls to contains will have the same
>      "want" list.

As long as this is advertised, I think it is fine.  If someone needs
to use --contains for multiple purposes in a builtin, that can be
dealt with then (maybe with a separate function to reset the marks).

Maybe it would make sense to keep this static in builtin/tag.c for
now, and defer cleaning up the interface to a separate patch.

> --- a/commit.c
> +++ b/commit.c

Now the meat of the patch.  git has a few algorithms for checking if
one commit is an ancestor of another.

is_descendant_of::

	1. find merge bases.  This is breadth-first search in
	   date order, but it walks through all ancestors as far
	   as I can tell.
	2. make them independent (seems pointless...).
	3. check if ancestor is one of them.

get_revision after setting UNINTERESTING (i.e., rev-list -1 ^tag cmit)::

	1. find interesting and uninteresting commits.  This is
	   breadth-first search in date order.
	2. when the first interesting commit is found, emit it
	   and stop.

Neither is optimized to check a batch of commits to find which contain
the commit of interest.

join_revs of show-branch::

	1. color the tag and cmit.
	2. let colors propagate back to all their ancestors, stopping
	   propagation along a given path when a commit has all colors.
	   This is breadth-first traversal in date order.
	3. for each commit with all colors, propagate that knowledge
	   to all parents we've encountered before.
	4. check what colors cmit has.

limit_to_ancestry::

	1. compute rev-list ^cmit tag in the usual way.
	2. reverse the revision list.
	3. color cmit with TMP_MARK and let that color permeate
	   up the backwards revision list.
	4. repeat! until no progress is made.
	4. check what color tag has.

None of these seem to tolerate timestamp skew. :(

Am I understanding correctly?

If so, just making is_descendant_of less stupid (e.g., not traversing
all history) would presumably help a lot.  Another approach would be
to try the show-branch tactic repeatedly with small enough batches of
tags to fit in the flag bitfield.  Yet another approach would be to
take the limit_to_ancestry one, but if the revision list is not
topologically sorted, it could be slow.

Hopefully your approach will be even better...

> @@ -845,3 +845,45 @@ int commit_tree(const char *msg, unsigned char *tree,
[...]
> +static int contains_recurse(struct commit *candidate,
> +			    const struct commit_list *want)
> +{
> +	struct commit_list *p;
> +
> +	/* was it previously marked as containing a want commit? */
> +	if (candidate->object.flags & TMP_MARK)
> +		return 1;
> +	/* or marked as not possibly containing a want commit? */
> +	if (candidate->object.flags & UNINTERESTING)
> +		return 0;
> +	/* or are we it? */
> +	if (in_commit_list(want, candidate))
> +		return 1;
> +
> +	if (parse_commit(candidate) < 0)
> +		return 0;
> +
> +	/* Otherwise recurse and mark ourselves for future traversals. */
> +	for (p = candidate->parents; p; p = p->next) {
> +		if (contains_recurse(p->item, want)) {
> +			candidate->object.flags |= TMP_MARK;
> +			return 1;
> +		}
> +	}
> +	candidate->object.flags |= UNINTERESTING;

Nice.  We cache the "is wanted" result using TMP_MARK and
UNINTERESTING.  It cannot be slower than is_descendant_of if I
understood correctly because the latter is a little crazy.

Possible unprincipled approach:

 1. Mark the wanted commits with TMP_MARK.
 2. Perform a breadth-first, date-order search starting at
    candidate to find a commit with TMP_MARK.
    a. If not found, mark the commits involved as UNINTERESTING
    b. If found, let the TMP_MARK permeate up using the
       limit_to_ancestry algorithm.
 3. Repeat with each other candidate.
 4. (If needed) walk starting at each candidate to clear the
    temporary flags.

I am not sure if that would be any faster, though.  What do you think?

[...]
> +int contains(struct commit *, const struct commit_list *);

Thanks for a pleasant read.

[1] http://thread.gmane.org/gmane.comp.version-control.git/152607/focus=152701
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html