Re: jk/tag-contains: stalled

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 05 Aug 2010 22:44:23 -0700

Jeff King <peff@xxxxxxxx> writes:

> On Thu, Aug 05, 2010 at 11:47:09AM -0700, Junio C Hamano wrote:
>
>> My gut feeling is that it is probably Ok if contains() and its
>> recursive helper are moved to builtin/tag.c and are made static, to
>> make it clear that this should not be reused outside the current
>> context as a generic "contains" function.  It would probably help to
>> have a comment at the end of list_tags() to say that TMP_MARK _ought_
>> to be cleaned before leaving the function but we don't do that because
>> we know it is the last function in the callchain before we exit.

After thinking about this a bit more, I changed my mind.

I think depth-first traversal from all tag tips, without running any
traversal from the wanted commit, has a serious downside.  What happens
when you run "git tag --contains master" in a Linus tree after a major
release but before he tags the -rc1 and closes the merge window?  Doesn't
the algorithm run all the way down to the root, only to say nothing?
Admittedly the traversal will visit each commit once (starting from
v2.6.12 down to v2.6.12-rc2, then skipping v2.6.12-rc2 through v2.6.12-rc6
because they are already seen and known not to contain the "master", then
starting from v2.6.13 down to v2.6.12, and so on), but visiting all the
commits down to root feels really wrong.

I think the merge-base traversal is essential.  Suppose we have hundreds
of commits with two tags 'T1' and 'T2':

                                  o---T2        o---o---Y
                                 /             /
 ---o---o---o---o---o---o---o---%---...---X---*---T1

If you are probing for 'X', traversing from all the tags until you hit a
wanted commit will stop at a reasonable point (namely, 'X') if you are
lucky and you started from 'T1' not 'T2', with your new algorithm.  If you
are unlucky and started from 'T2', you will however traverse down to root
from 'T2' which would be the bulk of the history.

Also, when you are probing for 'Y', you end up traversing all the way
down, whether you start from 'T1' or 'T2', don't you?

You can (and should) dig both from 'Y' and 'T1' and stop traversal at '*',
as that commit is an ancestor of 'Y' and at that point you have proven
that any ancestor of that you will find by traversing further will not
find 'Y'.  Similarly for 'Y' and 'T2' pair, which should stop at '%'.

As the author of show-branch, I think I know one trick that may speed this
up without compromising the worst case scenario.

We can run this traversal in parallel for many tags at once.  Use bit #0
of object.flags for the wanted commit, and bit #1..bit #N for N tags (if
you have more tags than that fit in the flags word, you run the algorithm
multiple times, N tags at a time, clearing the flags after each round).

The parallel traversal will run, smudging the parent commit's flags with
those of the child, in the usual way.

Suppose that you are probing for 'Y'.  You start traversal from 'Y', 'T1'
and 'T2' (as two tags will fit comfortably in a single flags word).  When
you come to commit '%', you will notice that all the ancestors of that
commit are contained in the wanted commit and all the tags you are probing
for ('T1' and 'T2').  There is no point to go further than that point
during this round, as you know you will not find 'Y' by digging further
from that point.

In general, when you find that a commit has both bit #0 and bit #i set,
that commit is known to be a common ancestor between 'Y' and 'Ti', so
there is no point going further from that point to see if 'Ti' contains
'Y'---you know it doesn't.  So you can stop when you see bit 0..N are all
set.  After traversing as many commits as possible this way, look at the
bits 1..i that is given to 'Y'.  They represent the set of tags that
contain 'Y' (you will find only bit #0 is set on 'Y').

Similarly, if you are probing for 'X', start the traversal from 'X', 'T1'
and 'T2'.  The traversal that begins at 'T1' will reach 'X' (so 'X' will
initially start with bit #0 set, but then when the traversal that started
from 'T1' reaches it, its bit #1 gets set) eventually.  Because bit #2 is
not set at that point, you don't stop and keep traversing.

You may already have started marking 'X' and its parent with bit #0 with
the traversal you started at 'X', but when you reach 'X' from 'T1' again,
you will be changing bit #1; anytime you change the flag, you would push
the commit back to the stack and re-traverse it.  Eventually, you will
smudge '%' with all 3 bits (depending on clock skew, you might traverse a
few more levels of its parents) and you can stop.  The flag word on 'X'
will have bis #0 and #1 set, so you know 'T1' reaches it but 'T2' doesn't.

To optimize this further, you may want to take advantage of the fact that
tags are not supposed to change, and have a cache that knows what tag is
contained in what other tag (e.g. if v1.0 is contained in v2.0, you do not
have to run the probe for v2.0 once you know the wanted commit is
reachable from v1.0).  But that can come later, I think.

But I haven't thought things really through.  If I am lucky, I may be able
to find time tomorrow during the day to try conjuring something out, but
no promises...
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html