Cloning subfolder as new root; subfolder as worktree

Mickey Endito <mickey.endito.2323@xxxxxxxxxxxxxx> · Thu, 20 Aug 2020 08:03:02 +0000

Dear all,

I'm currently missing a feature in git to be able to clone a subfolder as the
root or phrased differently use a subfolder as a worktree.

This mail has become rather long, so here is an overview: First, I state the
problem and give some use cases. Then, I list a couple of partial workarounds
and other methods for the problem, which are currently achievable with git.
Lastly, I provide an idea how this feature could be implemented within git
object model.

The Problem
-----------

Assume you have a git repository A with the following file structure:

foo/a.txt
bar/b.txt
bar/baz/c.txt

What I want to achieve is creating a git repo where bar/ equals the root /,
i.e. a repo B with the contents

b.txt
baz/c.txt

We can describe that as a lens or zoom of the repo A.
I believe svn had that capability but I'm not sure.

Applications for that feature
-----------------------------

Notation: I use "repo:path" to indicate that path should be seen relative to
the repository repo.

* I have a repo which mimics (parts of) my entire file system (think
  configuration files).  I'd like to be able to check out the subfolder
  repo:/etc/foobar in the actual filesystem folder /etc. While not checking out
  the rest of repo:/etc as that would lead in a disaster.

* I have a website project where the html files (which are not generated by
  a build script) are in repo:www/ and I'd like to check them out to /srv/www/
  for deployment.

* Think of a big project with different components possibly stacked deep in
  a directory structure. We want to work on a single component somewhere down
  that structure, e.g. repo:client/new/x11/gtk/daemon/test/testapp. We could
  use git-sparse-checkout for that but that would leave us with a lot of
  quasi-empty directories client/new/x11/gtk/daemon/test/ where in this case
  only the testapp directory is relevant.

Things/Workarounds I am aware of
--------------------------------

1. git submodules

You probably think use git-submodules. However, bar/ is not a dependency or
library where that would make sense but rather a part of of repo A.  So the
logical dependency is reversed: it's not A that depends on B but rather
B depends on A.  In some use cases changes in bar/ require additional changes
in foo (in that use cases B is like a read-only view).

2. git clone + filter-branch

We can clone the repo followed by a git-filter-branch (or its alternative
git-filter-repo)

git clone /path/to/repo/A /path/to/repo/B cd /path/to/repo/B git filter-branch
--subdirectory-filter bar -- --all

This creates a sort of read-only clone. But has massive drawbacks:
* We cannot do a simple git pull to update repo B to the new state of repo A.
  To do that we have to clone and filer-branch it again.
* It changes commit-IDs.
* We cannot push changes done in B back to A.

3. git-subtree

We can use git-subtree to filter the subdirectory and then clone the generated
branch as repo B, like so:

# in repo A
cd /path/to/repo/A
git subtree split --prefix=bar --annotate 'bar: ' --branch branch-bar
git clone -b branch-bar /path/to/repo/A /path/to/repo/B

Here we have:
* It requires support from repo A which must generate the branch-bar.
* Repo A now must contain two commit-histories (the main branch and the
  branch-bar) of the same logical-history.  In particular the commit ids are
  different for the same logical commit in the main branch and the branch-bar.
* branch-bar must be regenerated every time. I have not (yet) investigated
  whether git-subtree is capable of continuing a split from the last commit. So
  far I only managed that it recreates all commits in branch-bar (but with the
  same commit-ids as before)
* Because the commit-ids of branch-bar do not change (at least when called with
  the same arguments), we can use git pull to update repo B
* We can push changes in repo B back to repo A in the branch-bar. But
  I currently do not see a simple method how to incorporate this changes into
  the main branch.

4. git-sparse-checkout

We can use git-sparse-checkout like so

git clone /path/to/repo/A --no-checkout /path/to/repo/B
cd /path/to/repo/B
git sparse-checkout init
git spares-checkout set bar

This is kind a close to what I want in the sense that we can push and pull and
the commit-ids are unaltered. However, this totally gets the directory
structure wrong, which is a no-go in some of the above use cases.

An idea for a solution
----------------------

The following is an idea how the above feature could be implemented. This is
just a rough sketch and I have not thought how this approach would interact
with other git tooling.

We add a (for example; names are up to debate) --subfolder argument to git
clone:

git clone --subfolder bar /path/to/repo/A /path/to/repo/B

This clones the complete repo A but checks out contents of path bar in the root
directory. The HEAD points to the (full) commit. Additionally somewhere(tm) we
store that we have zoomed in to only see paths in bar (maybe git-worktree can
be expanded for that?).

That is stuff under bar is treated like a checkedout repo while all other stuff
is treated like being in a bare repo. (This at least should be the guide line
when thinking about the behaviour git should provide)

Doing a git push, git pull does the normal update of the repo but when checking
out files to the working directory only those files under bar/ are considered.

When editing and committing files in repo B, the following would be a sane thing
to do: The (old) tree of the current HEAD is taken and then the subtree
corresponding to bar is replaced with the tree in the index. That way we
generate a full valid commit which can be pushed back to repo A.

If we switch/checkout to a branch/commit that has not bar/ directory, then the
checkout copy should be empty. If we add something and commit it, then the
parent tree-objects of the new commit should be altered to contain the path bar.
As git does not track directories this should work out as expected.

Merge conflicts happen. If these happen for files inside the bar directory, the
we can do our usual stuff. Due to the flexibility of git we can arrange that the
commits/trees to be merged have a conflict outside of the bar directory. In
that case we cannot produce a working copy of the commit. Thus, it seems
appropriate to abort whatever we do and inform the user to use a full clone for
doing the merge.

When cloning repo B to repo C, there are not restrictions whatsoever as B has
a full copy of the repository (just not checked out). So when looking from
"outside" repo A and repo B are indistinguishable. Thus the following works:

git clone --subfolder bar /path/to/repo/A /path/to/repo/B
git clone /path/to/repo/B /path/to/repo/C

Repo A and C are the same (both without a zoom).

git clone --subfolder bar /path/to/repo/A /path/to/repo/B
git clone --subfolder foo /path/to/repo/B /path/to/repo/C1

git clone --subfolder foo /path/to/repo/A /path/to/repo/C2

Repo C1 and C2 are the same (both zoomed to foo/).

Non-goals:

The following (weird?) thing is outside of the scope of this idea.
zooming in into two (or more) directories simultaneously, e.g.

repo A:
foo1/bar/...
foo2/foo3/baz/...

and with the hypothetical git clone --subfolder foo1 --subfolder foo2/foo3

we get

repo B:
bar/...
baz/...

Also converting a zoomed repo B into a full (non-bare) repo A is not part of it.
Although I think, this could easily be achieved by some usage of deleting the
reference to the subfolder and doing a `git reset --hard` on the working copy.

Summary
-------

That the problem is not about the size of the checkout (which sparse-checkout
tackles), nor the size of the repo or the mount of data which needs to be
downloaded (both of which clone --depth tackles), its about getting the
directory structure in repo B right while also keeping a strong link to repo
A as upstream to pull (and maybe push) changes.

If I have missed any approach for a solution I'd like to hear about it.

Best
Mickey