Re: [PATCH 7/6] ref-filter: use generation number for --contains

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/4/2018 3:16 PM, Jeff King wrote:
On Wed, Apr 04, 2018 at 03:06:26PM -0400, Derrick Stolee wrote:

@@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
   					      struct contains_cache *cache)
   {
   	struct contains_stack contains_stack = { 0, 0, NULL };
-	enum contains_result result = contains_test(candidate, want, cache);
+	enum contains_result result;
+	uint32_t cutoff = GENERATION_NUMBER_UNDEF;
+	const struct commit_list *p;
+
+	for (p = want; p; p = p->next) {
+		struct commit *c = p->item;
+		parse_commit_or_die(c);
+		if (c->generation < cutoff)
+			cutoff = c->generation;
+	}
Now that you mention it, let me split out the portion you are probably
talking about as incorrect:

+	if (cutoff == GENERATION_NUMBER_UNDEF)
+		cutoff = GENERATION_NUMBER_NONE;
You're right, we don't want this. Since GENERATION_NUMBER_NONE == 0, we get
no benefit from this. If we keep it GENERATION_NUMBER_UNDEF, then our walk
will be limited to commits NOT in the commit-graph (which we hope is small
if proper hygiene is followed).
I think it's more than that. If we leave it at UNDEF, that's wrong,
because contains_test() compares:

   candidate->generation < cutoff

which would _always_ be true. In other words, we're saying that our
"want" has an insanely high generation number, and traversing can never
find it. Which is clearly wrong.

That condition is not always true (which is why we use strict comparison instead of <=). If a commit is not in the commit-graph file, then its generation is equal to GENERATION_NUMBER_UNDEF, as shown in alloc.c:

void *alloc_commit_node(void)
{
        struct commit *c = alloc_node(&commit_state, sizeof(struct commit));
        c->object.type = OBJ_COMMIT;
        c->index = alloc_commit_index();
        c->graph_pos = COMMIT_NOT_FROM_GRAPH;
        c->generation = GENERATION_NUMBER_UNDEF;
        return c;
}


So we have to put it at "0", to say "you should always traverse, we
can't tell you that this is a dead end". So that part of the logic is
currently correct.

But what I was getting at is that the loop behavior can't just pick the
min cutoff. The min is effectively "0" if there's even a single ref for
which we don't have a generation number, because we cannot ever stop
traversing (we might get to that commit if we kept going).

(It's also possible I'm confused about how UNDEF and NONE are used; I'm
assuming commits for which we don't have a generation number available
would get UNDEF in their commit->generation field).

I think it is this case.

If you could make the assumption that when we have a generation for
commit X, then we have a generation for all of its ancestors, things get
easier. Because then if you hit commit X with a generation number and
want to compare it to a cutoff, you know that either:

   1. The cutoff is defined, in which case you can stop traversing if
      we've gone past the cutoff.

   2. The cutoff is undefined, in which case we cannot possibly reach
      our "want" by traversing. Even if it has a smaller generation
      number than us, it's on an unrelated line of development.

I don't know that the reachability property is explicitly promised by
your work, but it seems like it would be a natural fallout (after all,
you have to know the generation of each ancestor in order to compute the
later ones, so you're really just promising that you've actually stored
all the ones you've computed).

The commit-graph is closed under reachability, so if a commit has a generation number then it is in the graph and so are all its ancestors.

The reason for GENERATION_NUMBER_NONE is that the commit-graph file stores "0" for generation number until this patch. It still satisfies the condition that gen(A) < gen(B) if B can reach A, but also gives us a condition for "this commit still needs its generation number computed".


I wonder to what degree it's worth traversing to come up with a
generation number for the "want" commits. If we walked, say, 50 commits
to do it, you'd probably save a lot of work (since the alternative is
walking thousands of commits until you realize that some ancient "v1.0"
tag is not useful).

I'd actually go so far as to say that any amount of traversal is
generally going to be worth it to come up with the correct generation
cutoff here. You can come up with pathological cases where you only have
one really recent tag or something, but in practice every repository
where performance is a concern is going to end up with refs much further
back than it would take to reach the cutoff condition.
Perhaps there is some value in walking to find the correct cutoff value, but
it is difficult to determine how far we are from commits with correct
generation numbers _a priori_. I'd rather rely on the commit-graph being in
a good state, not too far behind the refs. An added complexity of computing
generation numbers dynamically is that we would need to add a dependence on
the commit-graph file's existence at all.
If you could make the reachability assumption, I think this question
just goes away. As soon as you hit a commit with _any_ generation
number, you could quit traversing down that path.
That is the idea. I should make this clearer in all of my commit messages.

Thanks,
-Stolee




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux