submodules: multiple alternative relative URL schemes?

Matthew Ogilvie <mmogilvi+git@xxxxxxxx> · Sun, 4 Feb 2024 15:14:26 -0700

Background:

At work we are switching to submodules to share various
utility code between mostly-unrelated projects and programs.

When working on any one project, I commonly fetch and push
preliminary changes beween numerous different sandboxes in
various places (same machine, different machines, VM's, etc).
Basically, I make extensive use of git's distributed nature.

Issue:

However, submodules don't seem to streamline fetching/pushing
the whole collection of modules together from varied sources
very well, without manually configuring or manipulating each
submodule and its' "remote"s individually.  For example, if
I recursively clone from a local sandbox with local changes
into another local sandbox, it will clone the main superproject
from the local source, but the submodules will be downloaded
from the URLs in the .gitmodules file, not the local copies
(that might have local changes that will be missed).

-----

Relative URLs could maybe help, except there seems to be at least
two common relative URL repository schemes/organizations that
are often needed, and probably more:

1. Subproject bare repositories are immediately adjacent to the
superproject repository, such that relative URLs like
"../submodule1.git" should work.

This is probably common in web-based hosting of personal "forks"
like in github, bitbucket, or similar.

It is also the most obvious way to organize multiple
semi-related repositories on a small self-hosted server.

2. Or you want to fetch changes between sandboxes, perhaps
to test code changes on multiple platforms, or to continue someone
else's incomplete work.  In this case, relative links like
"./.git/modules/submodule1" would probably work.  But bare/origin/main
repositories would rarely be organized like this.

3. "Official" configured upstream URLs may also be organized in other
ways.  For example, in something like github, different upstream URLs
may be "owned" by different owners, so they may URLs like
"BASE/OWNER1/MODULE1.git", "BASE/OWNER2/MODULE2.git",
with different and semi-random "OWNER"s, not amenable to template
strings (see (E) below).  Or different submodule upstreams may be on
completely different servers...

-----

Basically, I would like to restore or at least improve git's
ability to transparently/smoothly work in a "distributed" manner
(no "blesssed main upstream repositories") when a project
requires certain submodules.

Some Ideas:

A. Is there something I'm missing?  Is there some good solution
I just haven't stumbled over?  Any documentation URLs and/or
keywords I should look for?

B. Of course no matter what, you could always set up all your various
remote URLs individually manually, or write special project-specific
helper scripts.  But something more automated and built-in
would be nice...

C. future?: Potentially, the .gitmodules file could define multiple
different URLs for each submodule, distinguished by named URL
"schemes" that could be specified on the command line.
"git clone", "git fetch", "git push", "git submodule", etc could
all take an optional "scheme" argument of some kind to indicate
which sets of URLs to use...

D. future?: Potentially hard-code some schemes?  1 and 2 are probably the
most common alternative relative URL schemes, so it might be worth
explicitly supporting them without requiring any explicit configuration
changes.  Maybe this would be adequate by itself; don't even support
other "alternative" named schemes (multiple 3's)?

E. future?: In some cases, a "scheme" for a whole set of URLs
could maybe be simplified into a single URL template string.
Substitute in the "base" and the submodule name to get a
specific URL...

F. Potentially add some optional fallback logic?  If it fails to
fetch/clone from one URL, try another scheme instead?  (But this
might leave things in a confusingly unpredictable/mixed state,
particularly if the network is flakey...)

G. Different approach: Instead of separate repositories, maybe
the submodules could essentially all be in one repository, but
distinguished by using different tags/namespaces in the "refs"
hierarchy.  ("refs/heads/SUBMODULE/WHATEVER"?
"refs/SUBMODULE/heads/WHATEVER"?)  The "original/main" upstream
might still split up repositories, but clones/local changes/etc
could be more self-contained this way...  (Working out all the
nuances to streamline this idea might be a whole separate
discussion...)

H. "git subtree" instead of submodules: If it was up to me, I like
the subtree solution.  Most of the time you can ignore the fact that
some code is shared, and only need to worry about it when actually
syncing/merging shared code between different projects, which is
generally much rarer than day-to-day development tasks.
But unfortunately subtrees seems to be somewhat unpopular, as
currently implemented: Isolated in "contrib", doing graph traversal
logic in shell script of all things (very slow, especially on
Windows where many of my coworkers reside), etc.  And ultimately
in my case I think I've already lost this argument.  So I'm trying
to figure out how to make submodules work better.

I. Any other ideas?

-----

In the short term, I expect I'll have to resort to (B) (custom scripts).

In the longer term, depending on what people think and if I can find
the time, maybe I could actually implement some of these ideas
in git for a future release?  What do people think of these ideas?

                - Matthew Ogilvie