Optimizing for partial clone with '--filter=tree:0'

Alexandr Miloslavskiy <alexandr.miloslavskiy@xxxxxxxxxxx> · Mon, 5 Oct 2020 18:38:17 +0200

We are implementing a git UI. One interesting case is the repository
cloned with '--filter=tree:0', because it makes it a lot harder to
run basic git operations such as file log and blame.

Eventually we arrived at a number of problems. We should be able to
make patches, at least for (2) and (4), if deemed worthy and the plan
is clear enough. Note that optimal patches (as we see it) will involve
a protocol change.

(1) Is it even considered a realistic use case?
-----------------------------------------------
I used Linux repository as an example of reasonably large repo:
  https://github.com/torvalds/linux.git (951025 commits)

I cloned Linux repository with various filters and got these stats:
  git clone --bare <url>
	7'624'042 objects
	   2.86gb network
	   3.10gb disk
  git clone --bare --filter=blob:none <url>
	5'484'714 (71.9%) objects
	   1.01gb (35.3%) network
	   1.16gb (37.4%) disk
  git clone --bare --filter=tree:0 <url>
	  951'693 (12.5%) objects
	   0.47gb (16.4%) network
	   0.50gb (16.1%) disk
  git clone --bare --depth 1 --branch master <url>
	   74'380 ( 0.9%) objects
	   0.19gb ( 6.6%) network
	   0.19gb ( 6.1%) disk

My conclusion is that '--filter=tree:0' could be desired because it
reasonably saves disk space and network.

(2) A command to enrich repo with trees
---------------------------------------
Since all filters currently include commit objects, it doesn't seem
possible to append the trees alone to a repository that already has
commits. It seems that it could be possible to download trees+commits
like this:

  git -c "remote.origin.partialclonefilter=blob:none" fetch
  --deepen=999999 origin

  Here, '--deepen' is a dirty hack to convince git to re-download
  commits that are already present locally (without trees though).

  Here, '-c' is a workaround for the problem where 'git fetch'
  overwrites filter in config. This problem is probably solved in
  cooking topic: 'fetch: do not override partial clone filter'.

However, according to figures in (1), re-downloading commits should
cost around the cost of 'clone --filter=tree:0', that is 0.5gb extra in
case of Linux repo. It would be nice to avoid that by having a filter
like "trees only please".

It would also be nice to get rid of '--deepen' hack.

(3) Properly supporting 'git blame' and 'git log -- path'
---------------------------------------------------------
Currently, promisor will download things one at a time, which is very
slow. For example, 'git blame' will download trees for commits,
processing one commit at a time. See (4) for a possible solution.

(4) Command to download ALL trees for a subpath
-----------------------------------------------
E.g. for blamed path '/1/2/3/4.txt', only parent trees will be
downloaded:
  '/1'
  '/1/2'
  '/1/2/3'

Such minimal approach should fall in line with user's intention for
using '--filter=tree:0' - user obviously wanted to minimize something,
be that disk or network used. It doesn't sound nice if the first
'git blame' reverts to a repo with all trees, as if cloned with
'--filter=blob:none'.

Currently '--filter=sparse:oid' is there to support that, but it is
very hard to use on client side, because it requires paths to be
already present in a commit on server.

For a possible solution, it sounds reasonable to have such filter:
  --filter=sparse:pathlist=/1/2'
Path list could be delimited with some special character, and paths
themselves could be escaped.

On top of helping with 'git blame' and 'git log', this feature should
help a lot with sparse clones of large mono-repos, such as Google's
super-mono-repo.