Re: [PATCH 8/9] fetch: retry fetching submodules if needed objects were not fetched

Stefan Beller <sbeller@xxxxxxxxxx> · Tue, 23 Oct 2018 15:37:48 -0700

On Wed, Oct 17, 2018 at 5:40 PM Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote:
>
> > Currently when git-fetch is asked to recurse into submodules, it dispatches
> > a plain "git-fetch -C <submodule-dir>" (with some submodule related options
> > such as prefix and recusing strategy, but) without any information of the
> > remote or the tip that should be fetched.
> >
> > This works surprisingly well in some workflows (such as using submodules
> > as a third party library), while not so well in other scenarios, such
> > as in a Gerrit topic-based workflow, that can tie together changes
> > (potentially across repositories) on the server side. One of the parts
> > of such a Gerrit workflow is to download a change when wanting to examine
> > it, and you'd want to have its submodule changes that are in the same
> > topic downloaded as well. However these submodule changes reside in their
> > own repository in their own ref (refs/changes/<int>).
>
> It seems that the pertinent situation is when, in the superproject, you
> fetch a commit (which may be the target of a ref, or an ancestor of the
> target of a ref) that points to a submodule commit that was not fetched
> by the default refspec-less fetch that you describe in the first
> paragraph. I would just describe it as follows:
>
>   But this default fetch is not sufficient, as a newly fetched commit in
>   the superproject could point to a commit in the submodule that is not
>   in the default refspec. This is common in workflows like Gerrit's.
>   When fetching a Gerrit change under review (from refs/changes/??), the
>   commits in that change likely point to submodule commits that have not
>   been merged to a branch yet.

Makes sense.

> Another thing you need to clarify is what happens if the fetch-by-commit
> fails. Right now, it seems that it will make the whole thing fail, which
> might be a surprising change in behavior.

But a positive surprise, I would assume?

> The test stores the result in a normal branch, not a remote tracking
> branch. Is storing in a normal branch required?

In the test we fetch from another repository, such that in the
repository-under-test this will show up in a remote tracking branch?

> Also, do you know why this is required? A naive reading of the patch
> leads me to believe that this should work even if merely fetching to
> FETCH_HEAD.

See the next patch, check_for_new_submodule_commits() is missing
for FETCH_HEAD.

>
> > +struct get_next_submodule_task {
> > +     struct repository *repo;
> > +     const struct submodule *sub;
> > +     unsigned free_sub : 1; /* Do we need to free the submodule? */
> > +     struct oid_array *commits;
> > +};
>
> Firstly, I don't think "commits" needs to be a pointer.
>
> Document at least "commits". As far as I can tell, if NULL (or empty if
> you take my suggestion), this means that this task is the first pass for
> this particular submodule and the fetch needs to be done using the
> default refspec. Otherwise, this task is the second pass for this
> particular submodule and the fetch needs to be done passing the
> contained OIDs.

Makes sense. I think I'll make it a non-pointer, but will introduce another
flag or counter for the phase.

>
> > +static const struct submodule *get_default_submodule(const char *path)
> > +{
> > +     struct submodule *ret = NULL;
> > +     const char *name = default_name_or_path(path);
> > +
> > +     if (!name)
> > +             return NULL;
> > +
> > +     ret = xmalloc(sizeof(*ret));
> > +     memset(ret, 0, sizeof(*ret));
> > +     ret->path = name;
> > +     ret->name = name;
> > +
> > +     return (const struct submodule *) ret;
> > +}
>
> What is a "default" submodule and why would you need one?

s/default/artificial/. Such a submodule is a submodule that has no
config in the .gitmodules file and its name == path.
We need to keep those around for historic reasons AFAICT, c.f.
c68f837576 (implement fetching of moved submodules, 2017-10-16)

> > +             task = get_next_submodule_task_create(spf->r, ce->name);
> > +
> > +             if (!task->sub) {
> > +                     free(task);
> > +                     continue;
> >               }
>
> Will task->sub ever be NULL?

Yes, if we fall back to these "default" submodule and just try if it
can be handled
as a submodule, but it cannot be handled as such,
get_next_submodule_task_create has

    task->sub = submodule_from_path(r, &null_oid, path);
    if (!task->sub) {
        task->sub = get_default_submodule(path);

and get_default_submodule can return NULL.

>
> > +     if (spf->retry_nr) {
> > +             struct get_next_submodule_task *task = spf->retry[spf->retry_nr - 1];
> > +             struct strbuf submodule_prefix = STRBUF_INIT;
> > +             spf->retry_nr--;
> > +
> > +             strbuf_addf(&submodule_prefix, "%s%s/", spf->prefix, task->sub->path);
> > +
> > +             child_process_init(cp);
> > +             prepare_submodule_repo_env_in_gitdir(&cp->env_array);
> > +             cp->git_cmd = 1;
> > +             cp->dir = task->repo->gitdir;
> > +
> > +             argv_array_init(&cp->args);
> > +             argv_array_pushv(&cp->args, spf->args.argv);
> > +             argv_array_push(&cp->args, "on-demand");
>
> This means that we need to trust that the last entry in spf->args can
> take an "on-demand" parameter. Could we supply that argument here
> explicitly instead?
>
> > +             argv_array_push(&cp->args, "--submodule-prefix");
> > +             argv_array_push(&cp->args, submodule_prefix.buf);
> > +
> > +             /* NEEDSWORK: have get_default_remote from s--h */
>
> What is s--h?

builtin/submodule--helper, will clarify

>
> > +static int commit_exists_in_sub(const struct object_id *oid, void *data)
> > +{
> > +     struct repository *subrepo = data;
> > +
> > +     enum object_type type = oid_object_info(subrepo, oid, NULL);
> > +
> > +     return type != OBJ_COMMIT;
> > +}
>
> Shouldn't the function name be commit_missing_in_sub?

yes.

>
> >  static int fetch_finish(int retvalue, struct strbuf *err,
> >                       void *cb, void *task_cb)
> >  {
> >       struct submodule_parallel_fetch *spf = cb;
> > +     struct get_next_submodule_task *task = task_cb;
> > +     const struct submodule *sub;
> > +
> > +     struct string_list_item *it;
> > +     struct oid_array *commits;
> >
> >       if (retvalue)
> >               spf->result = 1;
> >
> > +     if (!task)
> > +             return 0;
>
> When will task be NULL?
>
> > +
> > +     sub = task->sub;
> > +     if (!sub)
> > +             goto out;
>
> Same as above - when will sub be NULL?
>
> > +     it = string_list_lookup(&spf->changed_submodule_names, sub->name);
> > +     if (!it)
> > +             goto out;
>
> And "it" as well.

I'll add comments.

>
> > +     commits = it->util;
> > +     oid_array_filter(commits,
> > +                      commit_exists_in_sub,
> > +                      task->repo);
> > +
> > +     /* Are there commits that do not exist? */
> > +     if (commits->nr) {
> > +             /* We already tried fetching them, do not try again. */
> > +             if (task->commits)
> > +                     return 0;
>
> Clearer and more efficient if the check for task->commits was first
> before all the filtering.
>
> > +test_expect_success "fetch new commits on-demand when they are not reachable" '
>
> "not reachable" confused me - they are indeed reachable, just not from
> the default refspec.

Makes sense

>
> > +     git checkout --detach &&
> > +     C=$(git -C submodule commit-tree -m "new change outside refs/heads" HEAD^{tree}) &&
> > +     git -C submodule update-ref refs/changes/1 $C &&
> > +     git update-index --cacheinfo 160000 $C submodule &&
> > +     git commit -m "updated submodule outside of refs/heads" &&
> > +     D=$(git rev-parse HEAD) &&
> > +     git update-ref refs/changes/2 $D &&
> > +     (
> > +             cd downstream &&
> > +             git fetch --recurse-submodules --recurse-submodules-default on-demand origin refs/changes/2:refs/heads/my_branch &&
> > +             git -C submodule cat-file -t $C &&
> > +             git checkout --recurse-submodules FETCH_HEAD
> > +     )
> > +'
>
> For maximum test coverage, I think this test should involve 2
> submodules:
>
>  B   C  E   F
>   \ /    \ /
>    A      D
>
> and the upstream superproject having:
>
>   G -> H -> I
>
> The downstream superproject already has G, and is fetching I. In H, the
> submodule gitlinks point to B and E respectively, and in I, the
> submodule gitlinks point to C and F respectively. This ensures that both
> multiple submodules work, and that submodule commits are also fetched
> for interior superproject commits.

ok.