Re: [RFC] t7410: 210 tests for various 'git submodule update' scenarios

Johan Herland <johan@xxxxxxxxxxx> · Thu, 17 Apr 2014 13:42:42 +0200

TL;DR: I believe the update-related submodule.<name>.* options
(.branch, .update, .url, and even the --remote option) are
slowly growing into a separate DSL for specfying how to update
submodules. This adds a confusing and opaque layer in front of
what is essentially git commands run by "git submodule update"
within the submodule. Instead of adding more submodule.<name>.*
options to handle even more use cases, we should scale back on
the config options, and allow the user to specify what should
happen on "git submodule update" in the existing language of
git commands.

On Wed, Apr 16, 2014 at 7:21 PM, W. Trevor King <wking@xxxxxxxxxx> wrote:
> On Wed, Apr 16, 2014 at 02:54:48AM +0200, Johan Herland wrote:
>> This is a work-in-progress to flesh out (and promote discussion about)
>> the expected behaviors for all possible scenarios in which
>> 'git submodule update' might be run.
>
> This is lovely :).

Thanks!

Having been active in git-submodule development a few years ago,
I thought I had a fairly firm grasp of most details regarding
submodules, but when I recently tried to explain to someone how
"git submodule update" was supposed to work, I quickly found that
I had forgotten a lot, and that git-submodule had also changed.
After (albeit casually) reading the docs, I still didn't feel much
wiser, and this humongous testcase is sort of my ultimate attempt
to regain some understanding.

So far, my impression is that there are too many details/variables
and too much state to be able to sanely comprehend what is the intended
behavior in all cases, much less document in a way that is approachable
to regular users...

>> +#  - current state of submodule:
>> +#     ?.?.?.1 - not yet cloned
>> +#     ?.?.?.2 - cloned, detached, HEAD == gitlink
>> +#     ?.?.?.3 - cloned, detached, HEAD != gitlink
>> +#     ?.?.?.4 - cloned, on branch foo (exists upstream), HEAD == gitlink
>> +#     ?.?.?.5 - cloned, on branch foo (exists upstream), HEAD != gitlink
>> +#     ?.?.?.6 - cloned, on branch bar (MISSING upstream), HEAD == gitlink
>> +#     ?.?.?.7 - cloned, on branch bar (MISSING upstream), HEAD != gitlink
>
> The remote branches should only matter for the initial clone and
> --remote updates.  Also, only the configured submodule.<name>.branch
> (your first axis) should be checked; the locally checked-out submodule
> branch shouldn't matter.

Yes, I realized that while working out the various cases, but it was
not at all obvious from the start, given that the config option is
called submodule.<name>.branch and not submodule.<name>.upstream or
similar (which might prevent a casual user from misinterpreting the
option as having something to do with the current local branch in the
submodule). See D6 for more discussion about the naming of .branch.

In general, I experimented with a few different ways of organizing the
axes to get a "flow" of testcases that more naturally "revealed" the
behavior of "git submodule update", but I haven't found a good
organization, yet. I suspect there is none...

>> +# T2: Test with submodule.<name>.url != submodule's remote.origin.url. Does
>> +#     "submodule update --remote" sync with submodule.<name>.url, or with the
>> +#     submodule's origin? (or with the submodule's current branch's upstream)?
>
> All fetches should currently use the submodule's remote.origin.url.
> submodule.<name>.url is only used for the initial clone (*.*.*.1), and
> never referenced again.  This would change using my tightly-bound
> submodule proposal [1], where a difference between
> submodule.<name>.url and the submodule's @{upstream} URL would be
> trigger a dirty-tree condition (for folks with tight-bind syncing
> enabled) from which you couldn't update before resolving the
> difference.

Ok. As stated above, I am worried about the amount of duplicated
state between the superproject's submodule config (which itself is
split between .gitmodules and .git/config) and the submodule's own
config. And from the above paragraph, I suspect two more dimensions
need to be added to the test matrix:

 - submodule's remote.origin.url ==/!= submodule.<name>.url

 - "tightly-bound submodule" is enabled/disabled

Even if we might successfully deal with the duplicate state (and the
exploding test matrix), it will certainly not be easy to
describe/document the intended behavior in a way that's simple to
grasp and straightforward to use.

We should instead seek ways to minimize the duplication of state.
If it is indeed the case that several submodule.<name>.* values in the
superproject are only consulted _once_ (when cloning the submodule),
we should make it obvious that they will not be used after the clone
is done. Similarly, we should make it obvious that the update-related
options only apply exactly when "git submodule update" is run.

What about something like this:

 - submodule.<name>.create: The command that will be used by
   "git submodule update" to initially populate the contents of
   the submodule. The command is run from within the submodule
   directory, and the following environment variables are available:

    - $SUPER_URL: The URL of the superproject's current upstream
      remote.

    - $SUPER_HEAD: The current value of the superproject's HEAD.
      Typically a refname, but may be commit id (if detached).

    - $GITLINK: The current value of the superproject's gitlink
      for this submodule.

   Example values for submodule.<name>.create:

    - 'git clone -n $SUPER_URL/../sub.git . && git reset --hard $GITLINK'
      This clones from a relative submodule URL (by using $SUPER_URL),
      and then does a detached checkout of $GITLINK. This would be
      equivalent to the current "checkout-mode".

    - 'git clone -n git://url.to/sub.git . && git checkout dev-branch'
      This clones from an absolute submodule URL, and then (assuming
      "dev-branch" exists upstream, which triggers a checkout's DWIM
      behavior) creates a local "dev-branch" with origin/dev-branch
      as its upstream. This would be a suitable start point for
      developing on "dev-branch" in the submodule.

    - 'git clone -n -c branch.autosetuprebase=remote git://url.to/sub.git . \
       && git checkout dev-branch'
      Same as above, but prepares for a rebase-style workflow instead.

 - submodule.<name>.update: The command that will be run within the
   already-existing submodule by "git submodule update". The same
   enviroment variables as above are available. Example values for
   submodule.<name>.update:

    - 'git reset --hard $GITLINK'
      Equivalent to checkout-mode (without --remote).

    - 'git fetch && git reset --hard origin/HEAD'
      Equivalent to checkout-mode with --remote.

    - 'git merge $GITLINK'
      Equivalent to merge-mode (without --remote).

    - 'git pull origin next'
      Equivalent to merge-mode with --remote and .branch == next.

    - 'git rebase $GITLINK'
      Equivalent to rebase-mode (without --remote)

    - 'git pull --rebase origin pu'
      Equivalent to rebase-mode with --remote and .branch == pu.

    - 'git pull'
      Defer completely to the submodule's own config.

Pros/Cons:

 Pro: It is obvious to everybody exactly what will happen when
      "git submodule update" is run.

 Pro: We no longer duplicate state with the submodule's config.
      Instead, it is obvious when and how these options modify
      the submodule, and whether or not they operate independently
      of the submodule's own config.

 Pro: We reuse the current "language" (of git commands and
      command-line options) for describing how to perform updates,
      instead of having to invent a duplicate (but terser) language
      in the form of submodule.<name>.* options.

 Pro: With just a couple of options, we can obsolete most of the
      existing submodule.<name>.* options:

       - .branch == $foo is replaced by "git pull origin $foo" in
         the new .update command.

       - .update == none/checkout/merge/rebase is replaced by an
         appropriate git command in the new .update command.

       - .url becomes a 'clone' argument in the .create command.

       - The --remote option is replaced by updating against remote
         .branch instead of $GITLINK

 Pro: Allows for ultimate flexibility in handling submodule updates
      without increasing the pressure to keep adding
      submodule.<name>.* options.

 Con: Wildly backwards-incompatible with existing submodule.* configs.

 Con: Probably missing proper error handling and/or other handling of
      corner cases in the above example commands.

 Con: A user who has copied/modified these options into his
      superproject's .git/config will manually have to reconcile that
      with any future updates to .gitmodules.

 Con: Too much information in each config option? Maybe consider using
      hooks instead of config options?

>> +# D1: When submodule is already at right commit, checkout-mode currently does
>> +#     nothing. Should it instead detach, even when no update is needed?
>> +#     Affects: 1.2.1.4, 1.2.1.6, 2.2.1.4, 2.2.1.6, 3.2.1.4, 3.2.1.6
>
> “Checkout updates always leave a detached HEAD” seems easier to
> explain, so I'm leaning that way.

Yes, although I suspect different people using different workflows will
have different (but valid) opinions on how this should be handled.
Which is why I'm approaching the conclusion outlined in the TL;DR; i.e.
that adding more submodule.<name>.* options (which often interacts
with other options in complex ways) is probably NOT the way to go.

Instead, we should recognize that people may want to have their
submodules updated in so many different ways, that trying to encode it
in a collection of submodule.<name>.* options is pointless. We can then
provide something more flexible that reuses existing syntax, like the
"free-form" options suggested above, or maybe a "submodule-update" hook
that allows even more customization.

Obviously, we should still have good defaults to deal with the most
common cases, but the current set of options is growing too large to be
understandable by most users.

>> +# D2: Should all/some of 1.3.*/1.4.* abort/error because we don't know what to
>> +#     merge/rebase with (because .branch is unset)? Or is the current default
>> +#     to origin/HEAD OK?
>> +#     Affects: 1.3.*, 1.4.*
>
> Maybe you mean 1.3.*, 1.4.*, and 1.5.* (merge, rebase, and !command)?
> In all of these cases, we're integrating the current HEAD with the
> gitlinked (*.*.1.*) or remote-tracking branch (*.*.2.*).  Since
> submodule.<name>.branch defaults to master (and may be changed to HEAD
> after a long transition period? [2,3]), I don't think we should
> abort/error in those cases.

Yes, it seems we're settling on origin/HEAD as an acceptable default.

>> +# D3: When submodule is already at right commit, merge/rebase-mode currently
>> +#     does nothing. Should it do something else (e.g. not leave submodule
>> +#     detached, or checked out on the "wrong" branch (i.e. != .branch))?
>> +#     (This discussion point is related to D1, D5 and D6)
>
> “Non-checkout updates always leave you on a branch” seems easier to
> explain, but I think we'd want to distinguish between the local branch
> and the remote submodule.<name>.branch [1].  Lacking that distinction,
> I'd prefer to leave the checked-out branch unchanged.

Yes, although again I feel that we are making policy decision based on
what we _believe_ would be ok in our use cases. Adding a .local-branch
config option to specify what happens to the local branch would maybe
fix things for other use cases, but it would also compound the overall
complexity of "submodule update". See above arguments for replacing
config options with explicit git commands, to make "submodule update"
more transparent.

>> +# D4: When 'submodule update' performs a clone to populate a submodule, it
>> +#     currently also creates a default branch (named after origin/HEAD) in
>> +#     that submodule, EVEN WHEN THAT BRANCH WILL NEVER BE USED (e.g. because
>> +#     we're in checkout-mode and submodule will always be detached). Is this
>> +#     good, or should the clone performed by 'submodule update' skip the
>> +#     automatic local branch creation?
>> +#     Affects: 1.2.*.1, 1.3.*.1, 1.4.*.1, 1.5.*.1,
>> +#              2.2.*.1, 2.3.*.1, 2.4.*.1, 2.5.*.1,
>> +#              3.2.1.1, 3.3.1.1, 3.4.1.1, 3.5.1.1
>
> “Checkout updates always leave a detached HEAD” seems easier to
> explain, so I'm leaning that way.

This is not about detached HEAD. This is about the "extra" local branch
automatically created by "git clone". This branch is unused and
unnecessary in checkout-mode, and only coincidentally useful in other
modes (when .branch == origin/HEAD).

However, this is probably a git-clone problem, and not a git-submodule
problem: I am surprised that "git clone --no-checkout" produces the
same local branch (without checking it out). I would instead have
expected the clone to only have the refs/remote/origin/* refs (and
HEAD pointing to an unborn branch - like "git init" does).

>> +# D5: When in merge/rebase-mode, and 'submodule update' actually ends up doing
>> +#     a merge/rebase, that will happen on the current branch (or detached HEAD)
>> +#     and NOT on the configured (or default) .branch. Is this OK? Should we
>> +#     abort (or at least warn) instead? (In general, .branch seems only to
>> +#     affect the submodule's HEAD when the submodule is first cloned.)
>> +#     (This discussion point is related to D3 and D6)
>> +#     Affects: 1.3.1.3, 1.3.1.5, 1.3.1.7, 1.3.2.>=2,
>> +#              1.4.1.3, 1.4.1.5, 1.4.1.7, 1.4.2.>=2,
>> +#              2.3.1.3, 2.3.1.5, 2.3.1.7, 2.3.2.2, 2.3.2.4, 2.3.2.6,
>> +#              2.4.1.3, 2.4.1.5, 2.4.1.7, 2.4.2.2, 2.4.2.4, 2.4.2.6
>> +#              3.3.1.3, 3.3.1.5, 3.3.1.7
>> +#              3.4.1.3, 3.4.1.5, 3.4.1.7
>
> With the --remote option that added submodule.<name>.branch (which
> eventually landed with v8 of that series [4]), I initially imagined it
> as the name of the local branch [5], but transitioned to imagining it
> as the name of the remote-tracking branch in v5 of that series [6].
> There were no major logical changes between v5 and v8.  With the v8
> version that landed in Git v1.8.2, submodule.<name>.branch was clearly
> the name of the remote-tracking branch, and we gave no way to
> separately configure the local branch.

First of all, shame on me for not following the discussion at the time.
I still think that when the meaning changed from local to remote-
tracking branch, the actual option name should have changed as well.
Documenting that it specifies a remote branch is good, but not as good
as choosing a better name for it in the first place.

> Recently, we decided that local branches might be important after all
> [7], which lead to the partially landed v5 of my local-branch-creation
> series [8], now partially reverted with d851ffb (Revert "submodule:
> explicit local branch creation in module_clone", 2014-04-02).  Like v4
> of that series [9], I considered the landed-and-now-reverted v5 to be
> a stop-gap until we got better local-branch handling.
>
> Anyhow, that's why submodule.<name>.branch is only important when we
> interact with the remote repository (during clones and --remote
> updates).  We've never landed a patch that explicitly addresses what
> the local branch should be.

Thanks for the recap. I now realize that my above arguments against
increased complexity in submodule.<name>.* options arrive way too
late, and is probably more like trolling than like constructive input.
I am sorry for having been absent during most of this discussion.
Still, I'm afraid that the current set of options are so complex that
they will only ever be fully understood by the very small set of people
involved with their development.

>> +# D6: The meaning of submodule.<name>.branch is initially confusing, as it does
>> +#     not really concern the submodule's local branch (except as a naming hint
>> +#     when the submodule is first cloned). Instead, submodule.<name>.branch is
>> +#     really about which branch in the _upstream_ submodule
>
> Which is how gitmodules(5) explains it:
>
>   submodule.<name>.branch
>     A remote branch name for tracking updates…

Good, but I fear gitmodules(5) is too hidden for the regular user.
It'd be better to mention this in git-submodule(1), as I expect
gitmodules(5) is largely read by .gitmodules _authors_, and not
regular users. Obviously, the real fix would be a better name...

>> +#     submodule.<name>.url, or by the submodule's remote.origin.url?)
>> +#     want to integrate with.
>
> The submodule's remote.origin.url for everything except the initial
> clone (*.*.*.1).  See my response to T2.

As mentioned above, submodule.<name>.url is then an unnecessary state
duplication. We should make it more obvious that it is only ever used
on the initial clone (see my above argument for moving .url into an
explicit git-clone command)

>> …                               This is probably the more useful setting, and it
>> +#     becomes obviously correct after (re-)reading gitmodules(5) and
>> +#     git-config(1). However, from just reading the "update" section in
>> +#     git-submodule(1) (or not even that), things are not so clear-cut. Would
>> +#     submodule.<name>.upstream (or .remote-branch, or similar) be a better
>> +#     name for this?
>
> Are the docs from 23d25e4 (submodule: explicit local branch creation
> in module_clone, 2014-01-26; now reverted with d851ffb, Revert
> "submodule: explicit local branch creation in module_clone",
> 2014-04-02) clearer?  Maybe we can salvage some of those docs even
> though we've reverted the actual code changes?

As I said above, better docs is good, but no replacement for a better
named option.

>> +# D7: What to do when .branch refers to a branch that is missing from upstream?
>> +#     Currently, when trying to clone, the clone fails (which causes 'git
>> +#     submodule update --remote' to fail), but leaves the submodule in an
>> +#     uninitialized state (there is a .git, but the work tree is missing).
>> +#     This is probably not the behavior we want...
>> +#     Affects: pre, 3.2.2.1, 3.3.2.1, 3.4.2.1, 3.5.2.1
>
> I think we should remove the submodule's .git file after the failed clone.

Agreed, but does that extend to the superproject's .git/modules/<name> too?

Again, sorry for the length and lateness of this discussion. I still
hope it may be of some use, though.

...Johan

--
Johan Herland, <johan@xxxxxxxxxxx>
www.herland.net
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html