Re: [PATCH v4 09/10] fetch: fetch unpopulated, changed submodules

Glen Choo <chooglen@xxxxxxxxxx> · Fri, 04 Mar 2022 14:59:18 -0800

Junio C Hamano <gitster@xxxxxxxxx> writes:

> Glen Choo <chooglen@xxxxxxxxxx> writes:
>
>>  		item = string_list_insert(changed, name);
>> -		if (!item->util)
>> +		if (item->util)
>> +			cs_data = item->util;
>> +		else {
>>  			item->util = xcalloc(1, sizeof(struct changed_submodule_data));
>> -		cs_data = item->util;
>> +			cs_data = item->util;
>> +			cs_data->super_oid = commit_oid;
>> +			cs_data->path = xstrdup(p->two->path);
>> +		}
>
> I do not quite get this change.
>
> collect_changed_submodules() walks a range of revisions in the
> superproject, doing an equivalent of "git log --raw" and feeding the
> differences to this callback function.  The above code looks at the
> path and uses the "changed" string list to record which submodule
> was modified, what commit in the submodule is needed, etc.
>
> What happens when the range has more than one change to the same
> submodule?  cs_data has only one room for recording .super_oid
> (which commit in the superproject touches the submodule) and .path
> (where in the superproject's tree the submodule exists).  "git mv"
> of a submodule might be rare and it may not hurt too much that only
> a single .path can be kept, but it looks somewhat iffy.

Yes, I agree that it looks odd, which is why I added this comment to
hopefully make it less opaque:

  + * (super_oid, path) allows the submodule config to be read from _some_
  + * .gitmodules file. We store this information the first time we find a
  + * superproject commit that points to the submodule, but this is
  + * arbitrary - we can choose any (super_oid, path) that matches the
  + * submodule's name.

I guess this only says that it is ok to store .super_oid and .path from
any commit, but doesn't go in depth into _why_. It's ok because we only
need (.super_oid, .path) because repo_submodule_init(..., path,
treeish_name) maps these args to the submodule's name and gitdir (i.e.
.git/modules/<name>).

This means we don't worry about 'git mv' (super_oid's .gitmodules will
tell us the correct name even if the path changed relative to some other
commit), nor seeing the submodule more than once (it doesn't matter
whose .gitmodules we look at so long as repo_submodule_init() derives
the correct gitdir).

And now that you've pointed this out, I realize that we could do away
with (.super_oid, .path) altogether if we had a variant of
repo_submodule_init() that takes the submodule name instead of (path,
treeish_name). (We have a similar submodule_from_name(), but that only
reads the submodule config, not a struct repository.) I would prefer not
to introduce such a function so late into the review cycle, but I could
clean this up later.

>>  		oid_array_append(&cs_data->new_commits, &p->two->oid);
>
> At least, we are not losing any submodule commit even when the same
> submodule is touched more than once by the superproject, but it is
> dubious why we have cs_data.super_oid and cs_data.path in the first
> place.

On the hand, we actually need to record every submodule commit, so yes.

> How are they used, or are they something that seemed useful when the
> code was first written but it turned out that they weren't and left
> unused?
>
> Or do we need to make cs_data an array of 3-tuple { .super_oid,
> .submodule_oid, .path } for each submodule name?

To conclude:

- The changed_submodules string_list is basically a map that tells us,
  for a given submodule _name_, which commits we need to fetch and where
  repo_submodule_init() can read the submodule name from.
- We only use cs_data as a string_list_item.util, and the
  string_list_item.string is the submodule name itself.
- .new_commits tells us which commits to fetch.
- .super_oid and .path tells repo_submodule_init() how to get the name
  of the submodule.

So we don't need to make this a 3-tuple.