Re: [RFC PATCH] fetch-pack: lazy fetch using tree:0

Derrick Stolee <stolee@xxxxxxxxx> · Thu, 19 Mar 2020 15:58:33 -0400

On 3/19/2020 1:44 PM, Jonathan Tan wrote:
> Support for partial clones with filtered trees was added in bc5975d24f
> ("list-objects-filter: implement filter tree:0", 2018-10-07), but
> whenever a lazy fetch of a tree is done, besides the tree itself, some
> other objects that it references are also fetched.
> 
> The "blob:none" filter was added to lazy fetches in 4c7f9567ea
> ("fetch-pack: exclude blobs when lazy-fetching trees", 2018-10-04) to
> restrict blobs from being fetched, but it didn't restrict trees.
> ("tree:0", which would restrict all trees as well, wasn't added then
> because "tree:0" was itself new and may not have been supported by Git
> servers, as you can see from the dates of the commits.)
> 
> Now that "tree:0" has been supported in Git for a while, teach lazy
> fetches to use "tree:0" instead of "blob:none".
> 
> (An alternative to doing this is to teach Git a new filter that only
> returns exactly the objects requested, no more - but "tree:0" already
> does that for us for now, hence this patch. If we were to support
> filtering of commits in partial clones later, I think that specifying a
> depth will work to restrict the commits returned, so we won't need an
> additional filter anyway.)
> ---
> This looks like a good change to me - in particular, it makes Git align
> with the (in my opinion, reasonable) mental model that when we lazily
> fetch something, we only fetch that thing. Some issues that I can think
> about:
> 
>  - Some hosts like GitHub support some partial clone filters, but not
>    "tree:0".
>  - I haven't figured out the performance implications yet. If we want a
>    tree, I think that we typically will want some of its subtrees, but
>    not all.
> 
> Any thoughts?

The end result of fetching missing objects one-by-one matches how the
GVFS protocol has handled these tree misses in the past. While there
may be a lot more round trips, it saves on excess data since a
missing tree likely can reach several known trees and blobs.

The real unknown here is how the "boundary" of missing trees is
created. In the GVFS protocol, missing trees happen mostly when our
pre-computed "prefetch pack-files" of commits and trees are behind the
ref tips.

The usage pattern for depth-limited or path-scoped filters is not
quite as established as the blob-limited patterns (because they are
similar to the behavior in VFS for Git and Scalar).

The code seems to be doing what you say, but I highly recommend taking
this for a spin on a real repository with a real remote, if possible.
The more that we could get some numbers for which situations do better
in one case or the other, the more this change can be adopted with
confidence.

Thanks,
-Stolee