Tight submodule bindings (was: Preferred local submodule branches)

"W. Trevor King" <wking@xxxxxxxxxx> · Sat, 11 Jan 2014 17:08:47 -0800

On Wed, Jan 08, 2014 at 10:17:51PM -0800, W. Trevor King wrote:
> In another branch of the submodule thread Francesco kicked off, I
> mentioned that we could store the preferred local submodule branch on
> a per-superbranch level if we used the
> .git/modules/<submodule-name>/config for local overrides [1].  Here's
> a patch series that greatly extends my v2 "submodule: Respect
> requested branch on all clones" series [2] to also support automatic,
> recursive submodule checkouts, as I outlined here [3].
> 
> [1]: http://article.gmane.org/gmane.comp.version-control.git/240240
> [2]: http://article.gmane.org/gmane.comp.version-control.git/239967
> [3]: http://article.gmane.org/gmane.comp.version-control.git/240192

While mulling over better ways to explain my local-branch idea, I've
come up with a more tightly bound model that may help break the
silence that has greeted the “Preferred local submodule branches”
series ;).  That series doesn't have strong options on update
mechanics, which leads to wishy-washy exchanges where nobody has a
clear mental picture:

On Thu, Jan 09, 2014 at 10:40:52PM +0100, Jens Lehmann wrote:
> Am 09.01.2014 20:55, schrieb W. Trevor King:
> > On Thu, Jan 09, 2014 at 08:23:07PM +0100, Jens Lehmann wrote:
> >> Am 09.01.2014 18:32, schrieb W. Trevor King:
> >>>> when superproject branches are merged (with and without conflicts),
> >>>
> >>> I don't think this currently does anything to the submodule itself,
> >>> and that makes sense to me (use 'submodule update' or my 'submodule
> >>> checkout' if you want such effects).  We should keep the current logic
> >>> for updating the gitlinked $sha.  In the case that the
> >>> .gitmodule-configured local-branches disagree, we should give the
> >>> usual conflict warning (and <<<===>>> markup) and let the user resolve
> >>> the conflict in the usual way.
> >>
> >> For me it makes lots of sense that in recursive checkout mode the
> >> merged submodules are already checked out (if possible) right after
> >> a superproject merge, making another "submodule update" unnecessary
> >> (the whole point of recursive update is to make "submodule update"
> >> obsolete, except for "--remote").
> > 
> > If you force the user to have the configured local-branch checked out
> > before a non-checkout operations with checkout side-effects (as we
> > currently do for other kinds of dirty trees), I think you'll avoid
> > most (all?) of the branch-clobbering problems.
> 
> I'm thinking that a local branch works in two directions: It should
> make it easy to follow an upstream branch and also make changes to it
> (and publish those) if necessary. But neither local nor upstream
> changes take precedence, so the user should either use "merge" or
> "rebase" as update strategy or be asked to resolve the conflict
> manually when "checkout" is configured and the branches diverged.
> Does that make sense?

The current series is only weakly bound (you can explicitly call git
submodule checkout' to change to the preferred local submodule
branch), and the current Git is extremely weakly bound (you have to cd
into the submodule and change branches by hand).  The following
extrapolates the “Preferred local submodule branches” series to a
tightly-bound ideal.

Gitlinked commit hash
---------------------

The submodule model revolves around links to commits (“gitlinks”):

  $ git ls-tree HEAD
  100644 blob 189fc359d3dc1ed5019b9834b93f0dfb49c5851f    .gitmodules
  160000 commit fbfa124c29362f180026bf0074630e8bd0ff4550  submod

These are effectively switchable trees.  The tree referenced by commit
fbfa124 is 492781c:

  $ (cd submod/ && git cat-file commit fbfa124)
  tree 492781c581d4dec380a61ef5ec69a104de448a74
  …

If you init the submodule, subsequent checkouts will check out that
tree, just like 'git checkout' would do if you'd had a superproject
tree like:

  $ git ls-tree HEAD
  100644 blob 189fc359d3dc1ed5019b9834b93f0dfb49c5851f    .gitmodules
  040000 tree 492781c581d4dec380a61ef5ec69a104de448a74    submod

For folks who treat the submodule as a black box (and do no local
development), switchable trees are all they care about.  They can
easily checkout (or not, with deinit), the submodule tree at a
gitlinked hash, and everything is nice and reproducible.  The fact
that 'submod' is stored as a commit object and not a tree, is just a
convenient marker for optional init/deinit/remote-update-integration
functionality.

Additional metadata, the initial checkout, and syncing down
-----------------------------------------------------------

However, folks who do local submodule development will care about
which submodule commit is responsible for that tree, because that's
going to be the base of their local development.  They also care about
additional out-of-tree information, including the branch that commit
is on.  For already-initialized submodules, there are existing places
in the submodule config to store this configuration:

1. HEAD for the checked-out branch,
2. branch.<name>.remote → remote.<name>.url for the upstream
   subproject URL,
4. branch.<name>.rebase (or pull.rebase) to prefer rebase over merge
   for integration,
5. …

You need somewhere in-tree to store this destined-to-be-out-of-tree
information, so that superproject developers that have not yet
initialized the submodule will know what values are suggested by the
superproject maintainers.  That's where .gitmodules comes in, because
storing all of this fairly static, locally overridable information in
the gitlink itself would be nonsensical (said Linus in 2007 [1]).
When you checkout a submodule for the first time, Git should take the
default information from .gitmodules and file it away in the
submodule's appropriate out-of-tree config locations.  The out-of-tree
data listed above should be stored in:

1. submodule.<name>.local-branch
2. submodule.<name>.url
4. submodule.<name>.update
5. …

Once you have an in-tree way to specify defaults for this out-of-tree
information, you're going to have developers like me that just want to
stick with the defaults, following them through changes.  That means
you'd like to have the “copy .gitmodules defaults into your
submodule's config” functionality that usually happens on the initial
submodule checkout happen on *every superproject-initiated checkout*.
In fact, I think life is easier for everyone if this is the default,
and we add a new option (submodule.<name>.sync = false) that says
“don't overwrite optional settings in my submodule's out-of-tree
config on checkout” for for folks who want to opt out.  Don't worry,
this is not going to clobber people, because we'll be syncing the
other way too.

Syncing up
----------

In the previous section I explained how data should flow from
.gitmodules into out-of-tree configs.  What about the other direction?
We currently let folks handle this by hand, but I'd prefer a tighter
integration between the submodule config and the superproject tree to
avoid losing work.  That means changes to tracked submodule status
(checked-out hash, checked-out branch, upstream URL, upstream branch,
default integration strategy, …) should trigger dirty-tree status just
like uncommitted changes to in-tree files.  'git add' (or stash) on
the dirty submodule would store changed commit hashes in the index,
pull changed out-of-tree configs back into the in-tree .gitmodules,
and add the new .gitmodules to the index.  If the working .gitmodules
was already dirty (vs. the index), the add/stash should die without
making any changes.  If the user has disabled syncing between
.gitmodules and the submodule's out-of-tree configs, then don't worry
about optional settings.  Always sync the required settings, which at
this point would just be submodule.<name>.local-branch.

Purely local metadata
---------------------

Some metadata does not make sense in the superproject tree.  For
example, whether a submodule is interesting enough to checkout
(init/deinit) or whether you want to auto-sync optional metadata
.gitmodules defaults.  This metadata should live in the superproject's
out-of-tree config, and should not be stored in the in-tree
.gitmodules.  Since you *will* want to share the upstream URL, I
proposed using an explicit submodule.<name>.active setting to store
the “do I care” information [2], instead of overloading
submodule.<name>.url (I'd auto-sync the .gitmodule's
submodule.<name>.url with the subproject's remote.origin.url unless
the user opted out of .gitmodules syncing).

Subsequent checkouts
--------------------

Now that we have strict linking between the submodule state (both
in-tree and out-of-tree configs) and the superproject tree (gitlink
and .gitmodules), changing between superproject branches is really
easy:

1. Make sure the working tree is not dirty.  If it is, ask the user to
   either add-and-commit or stash, and then die to let them do so.

2. Checkout the new superproject branch.

   2.1. For each old submodule that doesn't exist in the new branch,
        blow away the submodule directory (assuming a new-style
        .git/modules/… layout, and not an old-style submod/.git/…
        layout).

   2.2. For each gitlinked submodule that didn't exist in the old
        branch, setup the submodule as if you were doing the initial
        cloning checkout (forcing a new local-branch to point at the
        gitlinked commit).  If you find local out-of-tree
        *superproject* configs that conflict with the .gitmodules
        values, prefer the superproject configs.  Clobber submodule
        configs and local branches at will (modulo
        submodule.<name>.sync), because any submodule configs that the
        user wanted to keep should have been added to the superproject
        branch earlier (or stashed).

Integrating other branches
--------------------------

Merges and rebases can alter the submodule's in-tree configs (and
create and remove submodules).  The existing logic for merging
.gitmodules and gitlinks works well, so stick with that.  In the event
that there are unresolvable conflicts, bail out and let the user
resolve the conflicts and use 'git commit' to finish checking out the
resolved state.

Issues
------

I like the current submodule integration configuration:

* submodule.<name>.branch (specify the remote branch to integrate, but
  I'd prefer submodule.<name>.integration-ref for clarity).
* submodule.<name>.update (specify how to integrate it, but I'd prefer
  submodule.<name>.integration-mode for clarity).

more than the current core integration configuration:

* branch.<name>.merge (with branch.<name>.remote, the branch to remote
  branch to integrate via merging).
* branch.<name>.rebase (override branch.<name>.merge to integrate via
  rebasing).

These seem to mix the orthogonal concepts of integration target and
integration mode, and the divergence from the .gitmodules
representation makes syncing awkward.

Summary
-------

New .gitmodules options:

* submodule.<name>.local-branch, store the submodule's HEAD, must stay
  in sync for checkouts.

New .git/config options:

* submodule.<name>.active, for init/deinit.

* submodule.<name>.sync, for whether you want to automatically sync
  the submodule's out-of-tree configs up to .gitmodules before
  checkout operations, and sync back from .gitmodules (possibly
  altered on the new branch) into the submodule's out-of-tree configs
  during checkout.

With this tighter binding, submodule information is either tracked in
the superproject, or explicitly not touched by the superproject.  That
makes it much harder to break things or clobber a user's work, and
also much easier to keep submodules up to date with superproject
changes.  Users shouldn't have to explicitly manage their submodules
to carry out routine core tasks like checking out other branches.

I see no reason to add --recurse-submodule flags to 'git checkout'
(and merge, …).  Anything that happens post-clone should recurse
through submodules automatically, and use the submodule.<name>.active
setting to decide when recursion is desired.

I think the ideal submodule-specific interface would be just:

* git submodule [--quiet] add [-b <branch>] [-f|--force] [--name <name>]
                [--reference <repository>] [--] <repository> [<path>]
* git submodule [--quiet] init [--] [<path>...]
* git submodule [--quiet] deinit [-f|--force] [--] <path>...
* git submodule [--quiet] foreach [--recursive] <command>

The current 'git submodule update --remote' would just be:

  $ git submodule foreach 'git pull'

because all of the local-branch checkouts would have already been
handled.  Similarly, a global push would be just:

  $ git submodule foreach 'git push'

You get all the per-submodule configuration (for triangular workflows,
etc.) for free, with no submodule-specific confusion.

So, is this:

* Interesting enough to be worth pursuing?
* Simple enough to be easily understood?

I'd be happy to mock this up in shell, but only if anyone else would
be interested enough to review the implementation ;).  Then I'll look
into integrating the preferred model (this tightly bound proposal, or
v3's looser bindings, or <your idea here>) in C, building on Jens and
Jonathan's work.

Cheers,
Trevor

[1]: http://article.gmane.org/gmane.comp.version-control.git/44162
[2]: http://article.gmane.org/gmane.comp.version-control.git/211042

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy
Attachment:
signature.asc

Description: OpenPGP digital signature