Re: [RFC PATCH] fetch-pack: lazy fetch using tree:0

Jeff King <peff@xxxxxxxx> · Fri, 20 Mar 2020 02:12:14 -0400

On Thu, Mar 19, 2020 at 10:44:39AM -0700, Jonathan Tan wrote:

> Support for partial clones with filtered trees was added in bc5975d24f
> ("list-objects-filter: implement filter tree:0", 2018-10-07), but
> whenever a lazy fetch of a tree is done, besides the tree itself, some
> other objects that it references are also fetched.
> 
> The "blob:none" filter was added to lazy fetches in 4c7f9567ea
> ("fetch-pack: exclude blobs when lazy-fetching trees", 2018-10-04) to
> restrict blobs from being fetched, but it didn't restrict trees.
> ("tree:0", which would restrict all trees as well, wasn't added then
> because "tree:0" was itself new and may not have been supported by Git
> servers, as you can see from the dates of the commits.)
> 
> Now that "tree:0" has been supported in Git for a while, teach lazy
> fetches to use "tree:0" instead of "blob:none".

This does mean a new client fetching objects for a partial clone from an
older server (pre-bc5975d24f) used to work, but now won't (we couldn't
have fetched from it with a tree filter, but this patch makes the use of
tree:0 unconditional; so even a blob:none followup fetch would use it).
I'm not _too_ broken up about that, given that partial clone support at
that era was pretty clearly labeled as experimental. But it would be a
nice bonus to make it work everywhere.

> (An alternative to doing this is to teach Git a new filter that only
> returns exactly the objects requested, no more - but "tree:0" already
> does that for us for now, hence this patch. If we were to support
> filtering of commits in partial clones later, I think that specifying a
> depth will work to restrict the commits returned, so we won't need an
> additional filter anyway.)

The depth thing might work for commits, though there are a lot of
special code paths taken when the client is asking for shallow commits
that might be better avoided.

Being able to say "only send me the objects I'm asking for" seems like a
much more direct way. It doesn't even need to be a filter, really; it
could be a protocol capability. And in fact I think we'd want a
capability either way, because clients would ideally be able to send the
old "blob:none" for older servers, or the new "only what I'm asking for"
with new servers.

> ---
> This looks like a good change to me - in particular, it makes Git align
> with the (in my opinion, reasonable) mental model that when we lazily
> fetch something, we only fetch that thing. Some issues that I can think
> about:

Yeah, I like the mental model. I just think it should be expressed even
more explicitly. :)

>  - Some hosts like GitHub support some partial clone filters, but not
>    "tree:0".

Yes, this is going to fail against GitHub servers, just like it would
for older servers. One way to prevent that would be to use a "blob"
filter if that's what we originally partial-cloned with. I don't know if
that information always reliably makes it into this code path, though.
I think I'd prefer a capability-based fix in the long run.

We may support "tree:0" eventually at GitHub. It's quick to compute with
bitmaps, just like "blob:none" is. But "tree:1" isn't.

One side note (for Taylor, cc'd): our patches elsewhere to limit the
allowed filters don't make it possible to express the difference between
"tree:0" and "tree:1". It may be worth thinking about that, especially
if it influences the config schema (since we'll have to support it
forever once it makes it into a release).

>  - I haven't figured out the performance implications yet. If we want a
>    tree, I think that we typically will want some of its subtrees, but
>    not all.

I could imagine a scenario where you want to get trees one level at a
time in order to only grab the sub-trees you want based on pathnames
(sort of like sparse-checkout's cone mode). Though you do get into "n+1"
fetches based on tree depth there. If the latency for a fetch is high,
it will be pretty painful.

I can equally imagine there are cases where you want to grab the whole
subtree in one go, but I think that raises another performance issue:
you might already have most of it. E.g., consider a root tree with one
toplevel subtree that contains a million files. You already have the
root tree at some commit A. Now you want to diff against its parent, B.
You ask the server for B^{tree}, and it sends you the million-entry
tree, too (and maybe some blobs?). You could tell it you already have
them, but you don't actually know what's in B^{tree} until you get it.
And advertising all of your trees and blobs is prohibitively expensive.

So I think that pushes us back towards wanting an "n+1" scheme, even if
the latency is bad. And is really why partial clone is _so_ much easier
if you just resign yourself to giving the client all the commits and
trees. :)

-Peff