Re: [RFC/PATCHv2 6/6] limit "contains" traversals based on commit generation

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 15 Jul 2011 11:22:03 -0700

Jeff King <peff@xxxxxxxx> writes:

> diff --git a/builtin/tag.c b/builtin/tag.c
> index 63bce6e..df6de47 100644
> --- a/builtin/tag.c
> +++ b/builtin/tag.c
> @@ -40,7 +40,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
>  }
>  
>  static int contains_recurse(struct commit *candidate,
> -			    const struct commit_list *want)
> +			    const struct commit_list *want,
> +			    unsigned long cutoff)
>  {
>  	struct commit_list *p;
>  
> @@ -57,9 +58,13 @@ static int contains_recurse(struct commit *candidate,
>  	if (parse_commit(candidate) < 0)
>  		return 0;
>  
> +	/* stop searching if we go too far back in time */
> +	if (commit_generation(candidate) < cutoff)
> +		return 0;
> +

Here, the "generation number" was the commit timestamp of the commit in
your earlier round, but now it is not.

I agree with Linus that for the purpose of "rev-list A ^B ^C" computation,
"generation number" is a much better thing to use than the commit
timestamp, and I also think if we want to revamp the codepath around
still_interesting() in revision.c, Linus's "let's add generation header in
new commits" is a good first step. I haven't stared at that codepath long
enough lately to say for sure, but I suspect that in that codepath we
could use generation number in commits when available and fall back to
timestamp with slop without recomputing the generation number and caching
for older commits.

But that is _not_ the codepath your series is about.

What you are trying to say in this series is that "If a tag points at a
commit X (i.e. candidate), another commit Y (i.e. "want") that is much
younger than cannot possibly be included by it, because a tag was made on
commit X way before Y was created". You cut off by "age".

The heuristics would work efficiently to check what tags point at a
relatively recent commit (e.g. when trying to see which releases are
affected and needing a fix by a recently discovered bug introduced by
commit Y) by discarding tags for ancient releases. In such a scenario, the
timestamp of a tagged and ancient commit X on a side branch that was never
merged in the mainline that leads to commit Y (i.e. "want"), in an ideal
world without clock skews, will be way older than the timestamp of commit
Y. In other words, if you use timestamp as "age", even though X and Y do
not relate to each other directly, except that they may share an ancient
common ancestor, their "age"s can be compared and you could apply your
heuristics to optimize.

But if you use the generation number as "age", even in an ideal world
without clock skew nor miscomputed generation number, you no longer can
compare "age" of X and Y.  The ancient side branch that led to X may have
tons more commits than the history leading to Y.

So I have two tangents here.

 * As to revision traversal, the reason we do the SLOP in
   still_interesting() is to avoid the issue arising from the following
   topology:

              5
             /
     ---2---4---...---1
         \
          3---6

   where numbers denote the timestamp of the commit (1 incorrectly records
   ancient timestamp, while everybody else has newer timestamp than its
   parents). When we try to "rev-list 5 --not 1 6", we start from having
   ~1, ~6 and 5 to "currently active" list, and iteratively pick more
   recent commits from the active list to dig deeper while newly found
   ancestors back to the active list. ~6 leads to 3 marked as
   uninteresting and active list gets ~3 while ~6 is removed. 5 leads to 4
   marked as interesting and 4 gets inserted in the list while 5 is
   removed. 4 finds 2 to also be interesting. Then ~3 finds 2 to be
   uninteresting. At that point, we have ~1 (still not processed) and ~2
   in the active list, while we have 5 and 4 in the result list. We need
   to dig through ~1 to realize that 4 is reachable from it to mark it
   uninteresting.

   So how about marking commits (using the metainfo-cache facility) that
   has an ancestor (not necessarily its direct parent) that records a
   younger timestamp (e.g. 1 is such a commit, as its ancestors include
   things like 2 and 4)? There should be relatively small number of them,
   and still_interesting() logic can be told to dig through such commits
   even if everybody is uninteresting in the active list.

 * As to "tag --contains", when timestamp based heuristics breaks down is
   when a tagged commit incorrectly records way young timestamp or the
   "want" commit records way old timetsamp. I haven't thought things
   through, but the same metainfo-cache may be useful to detect which
   commit to dig through ignoring the cutoff heuristics.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html