Re: [PATCH] fetch: avoid quadratic loop checking for updated submodules

Jeff King <peff@xxxxxxxx> · Tue, 13 Sep 2011 18:17:45 -0400

On Tue, Sep 13, 2011 at 09:13:10PM +0200, Jens Lehmann wrote:

> The real world use case I have for that is that when a bug introduced by
> a new submodule commit is detected later on, the superproject's fix
> records an older submodule commit to remove the problematic change from
> the superproject. But the submodule's branch isn't rewound (as that is
> most probably master) but a fix is applied on top of it. Then that one
> gets tested and - if it was found ok - recorded in the superproject.

OK, that makes sense. It is a "rewind" from the perspective of the
superproject, but there is never a fork; because the submodule didn't
rewind, when we do get a new submodule state in the superproject, it
will be a fast forward from the old state.

That is a reasonable use case.

It's still probably a minority case, and we have to pay for the full
traversal each time, which is annoying. But now that it's turned off by
default if you don't have any submodules, I'm less concerned about the
performance impact. And superproject repositories are probably not going
to have the same number of commits as submodule repositories, so it may
be less of an issue.

One thing that could make it slightly less expensive (but I wouldn't
worry about implementing until it becomes an issue): you do a full diff
and collect any changed GITLINK entries, and then compare the paths we
get with the submodule config. Couldn't you instead do something like
this:

  - let S = set of submodule paths that we care about, from the config
  - start the "git rev-list $new --not $old" traversal, as we do now
  - for each commit
    - diff using a pathspec of S
    - for each changed entry
      - add it to the set of changed submodules
      - remove it from S
    - if S is empty, break

That has two advantages:

  1. We limit the diff by pathspec, which means we can avoid looking at
     irrelevant parts of the tree (we don't do this yet, but hopefully
     we will in the future).

  2. You can break out of the traversal early if you already know you
     have changes in each submodule of interest.

> Changes like this could be overlooked if you only compare "before" and
> "after" instead of traversing, leading to not fetching a submodule which
> does have new commits that are used in the newly fetched superproject's
> commits. I'd like to have on-demand fetch keep the correct solution of
> traversing unless we have good reasons against it, especially as teaching
> checkout & friends to recursively update submodules too depends on all
> needed commits being present.

Out of curiosity, what happens if we don't have such a commit? I know
you said that your policy is never to rewind a submodule's commit that
has been published in a superproject, and I think that's the only sane
thing to do. But it's not enforced by git itself, and I wonder how badly
we break if it does happen (i.e., I'm hoping the answer is "you can't
diff or checkout superproject revisions that reference the missing bit"
and not "git log crashes when it gets to that point in history").

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html