Re: Questions about partial clone with '--filter=tree:0'

Taylor Blau <me@xxxxxxxxxxxx> · Tue, 20 Oct 2020 18:29:34 -0400

Hi Alexandr,

On Tue, Oct 20, 2020 at 07:09:36PM +0200, Alexandr Miloslavskiy wrote:
> This is a edited copy of message I sent 2 weeks ago, which unfortunately
> didn't receive any replies. I tried to make make it shorter this time :)

Oops. That can happen sometimes, but thanks for re-sending. I'll try to
answer the basic points below.

> ----
>
> We are implementing a git UI. One interesting case is the repository
> cloned with '--filter=tree:0', because it makes it a lot harder to
> run basic git operations such as file log and blame.
>
> The problems and potential solutions are outlined below. We should be
> able to make patches for (2) and (3) if it makes sense to patch these.
>
> (1) Is it even considered a realistic use case?
> -----------------------------------------------
> Summary: is '--filter=tree:0' a realistic or "crazy" scenario that is
> not considered worthy of supporting?

It's not an unrealistic scenario, but it might be for what you're trying
to build. If your UI needs to run, say, 'git log --patch' to show a
historical revision, then you're going to need to fault in a lot of
missing objects.

If that's not something that you need to do often or ever, then having
'--filter=tree:0' is a good way to get the least amount of data possible
when using a partial clone. But if you're going to be performing
operations that need those missing objects, you're probably better eat
the network/storage cost of it all at once, rather than making the user
wait for Git to fault in the set of missing objects that it happens to
need.

> (2) A command to enrich repo with trees
> ---------------------------------------
> There is no good way to "un-partial" repository that was cloned with
> '--filter=tree:0' to have all trees, but no blobs.

There is no command to do that directly, but it is something that Git is
capable of.

It would look something like:

  $ git config remote.origin.partialclonefilter 'blob:none'

Now your repository is in a state where it has no blobs or trees, but
the filter does not prohibit it from getting the trees, so you can ask
it to grab everything you're missing with:

  $ git fetch origin

This should even be a pretty fast operation for repositories that have
bitmaps due to some topics that Peff and I sent to the list a while ago.
If it isn't, please let me know.

> There seems to be a dirty way of doing that by abusing 'fetch --deepen'
> which happens to skip "ref tip already present locally" check, but
> it will also re-download all commits, which means extra ~0.5gb network
> in case of Linux repo.

Mmm, this is probably not what you're looking for. You may be confusing
shallow clones (of which --deepen is relevant) with partial clones
(to which --deepen is irrelevant).

> (3) A command to download ALL trees and/or blobs for a subpath
> -----------------------------------------------
> Summary: Running a Blame or file log in '--filter=tree:0' repo is
> currently very inefficient, up to a point where it can be discussed
> as not really working.

This may be a "don't hold it that way" kind of response, but I don't
think that this is quite what you want. Recall that cloning a
repository with an object filter happens in two steps: first, an initial
download of all of the objects that it thinks you need, and then
(second) a follow-up fetch requesting the objects that you need to
populate your checkout.

I think what you probably want is a step 1.5 to tell Git "I'm not going
to ask for or care about the entirety of my working copy, I really just
want objects in path...", and you can do that with sparse checkouts. See
https://git-scm.com/docs/git-sparse-checkout for more.

The flow might be something like:

  $ git clone --sparse --filter=tree:0 git@xxxxxxxxxxxx:repo.git

and then:

  $ cd repo
  $ git sparse-checkout add foo bar baz
  $ git checkout .

Thanks,
Taylor