Re: [PATCH 0/8] fetch --recurse-submodules: fetch unpopulated submodules

Glen Choo <chooglen@xxxxxxxxxx> · Fri, 11 Feb 2022 10:39:56 +0800

Junio C Hamano <gitster@xxxxxxxxx> writes:

>>> The real question is not "in which submodules we fetch", but "what
>>> commits we fetch in these submodules".  I do not think there is a
>>> good answer to the latter.
>>>
>>> Of course, we we take this sequence instead:
>>>
>>> 	git checkout branch-with-submodules
>>> 	git fetch --recurse-submodules
>>> 	git checkout --recurse-submodules branch-with-submodules
>>> 	
>>> things should work correctly (I think we both are assuming that the
>>> other side allows to fetch _any_ object, not just ref), as "fetch"
>>> knows what superproject commit it is asked to complete, unlike the
>>> previous example you gave, where it does not have a clue on what
>>> superproject commit it is preparing submodules for, right?
>>
>> So, given my prior description of recursive fetch, we actually _do_ know
>> which superproject commits to prepare for and which submodule commits to
>> fetch.
>
> Just to make sure I understand what is going on, let me rephrase.
>
>  * To find out which submodule commits we need to fetch, we find new
>    commits in the superproject we just fetched, inspect the trees of
>    these commits to see gitlinks that name commits we need to fetch
>    into the submodule repositories.
>
>  * For that to work well, we need to know, from the path these
>    commits appear in the trees of the superproject, to find out from
>    which submodule to fetch these commits from.  And to make the
>    mapping from paths to submodule names, we need to read
>    .gitmodules from the same superproject commit we found the
>    submodule commit in (as during the history of the superproject,
>    the submodule may have moved around).
>
> If so, I understand why being able to read .gitmodules from
> superproject commits is essential.  The flow would become like
>
>  (1) fetch in the superproject
>
>  (2) iterate over each new superproject commit:
>      - read its .gitmodules
>      - iterate over each gitlink found in the superproject commit:
>        - map the path we found gitlink at into module name
>        - find the submodule repository initialized for the module
>          - if the submodule is not of local interest, skip
>          - add the submodule commit pointed by gitlink to the
>            set of commits that need to be fetched for the submodule [*]
>
>  (3) iterate over each submodule we found more than one commits that
>      need to be fetched in, and fetch these commits (we do not have
>      to go over the network to re-fetch commits that exist in the
>      object store and are reachable from the refs, but "fetch"
>      already knows how to optimize that).
>
> Am I on the right track?

Yup, I think that's quite an accurate description. In particular..

>  (2) iterate over each new superproject commit:
>      - read its .gitmodules

Prior to this series, .gitmodules is read from the filesystem even
though we may notice the missing commit in a non-checked-out
superproject commit.

>  (3) iterate over each submodule we found more than one commits that
>      need to be fetched in, and fetch these commits

Yes, this describes the new "fetch changed submodules behavior"
accurately. However, we also attempt to fetch checked out submodules,
and this is where the two fetching strategies, "yes" and "on-demand" [1]
matter:

"yes", aka "--recurse-submodules" tells "git fetch" to attempt to fetch
_every_ checked out submodule regardless of whether Git thinks it has
missing commits (if we do not find any missing commits, I believe it
defaults to the "git fetch <remote-name>" form). [2]

"on-demand", aka "--recurse-submodules=on-demand", is the 'default'
option. (It is 'default' as in the sense of not passing a
"--recurse-submodules" arg, not default as in passing
"--recurse-submodules" without an "="). With "on-demand", we _only_
attempt to fetch changed submodules. Vis-a-vis "yes", this strategy has
no effect on fetching non-checked-out submodules because we only
fetch changed, non-checked-out submodules anyway, but it lets us ignore
unchanged, checked out submodules.

[1] From Documentation/fetch-options.txt:

--recurse-submodules[=yes|on-demand|no]::
  This option controls if and under what conditions new commits of
  populated submodules should be fetched too. It can be used as a
  boolean option to completely disable recursion when set to 'no' or to
  unconditionally recurse into all populated submodules when set to
  'yes', which is the default when this option is used without any
  value. Use 'on-demand' to only recurse into a populated submodule
  when the superproject retrieves a commit that updates the submodule's
  reference to a commit that isn't already in the local submodule
  clone. By default, 'on-demand' is used, unless
`fetch.recurseSubmodules` is set (see linkgit:git-config[1]).

[2] Sidenote on the "yes" option: I think the rationale for doing
unconditional fetches is not clear to reviewers. Admittedly, beyond 'the
test suite fails', I don't really remember why either. I'll dig into
this as I respond to the reviewer feedback.