Re: Partial clone design (with connectivity check for locally-created objects)

Ben Peart <peartben@xxxxxxxxx> · Mon, 7 Aug 2017 15:12:11 -0400

On 8/4/2017 8:21 PM, Jonathan Tan wrote:
On Fri, 04 Aug 2017 15:51:08 -0700
Junio C Hamano <gitster@xxxxxxxxx> wrote:

Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:

"Imported" objects must be in a packfile that has a "<pack name>.remote"
file with arbitrary text (similar to the ".keep" file). They come from
clones, fetches, and the object loader (see below).
...
A "homegrown" object is valid if each object it references:
  1. is a "homegrown" object,
  2. is an "imported" object, or
  3. is referenced by an "imported" object.

Overall it captures what was discussed, and I think it is a good
start.

I missed the offline discussion and so am trying to piece together what 
this latest design is trying to do.  Please let me know if I'm not 
understanding something correctly.

From what I can tell, objects are going to be segmented into two 
"types" - those that were fetched from a remote source that allows 
partial clones/fetches (lazyobject/imported) and those that come from 
"regular" remote sources (homegrown) that requires all objects to exist 
locally.

FWIW, the names here are not making things clearer for me. If I'm 
correct perhaps "partial" and "normal" would be better to indicate the 
type of the source? Anyway...

Once the objects are segmented into the 2 types, the fsck connectivity 
check code is updated to ignore missing objects from "partial" remotes 
but still expect/validate them from "normal" remotes.

This compromise seems reasonable - don't generate errors for missing 
objects for remotes that returned a partial clone but do generate errors 
for missing objects from normal clones as a missing object is always an 
error in this case.

This segmentation is what is driving the need for the object loader to 
build a new local pack file for every command that has to fetch a 
missing object.  For example, we can't just write a tree object from a 
"partial" clone into the loose object store as we have no way for fsck 
to treat them differently and ignore any missing objects referenced by 
that tree object.

My concern with this proposal is the combination of 1) writing a new 
pack file for every git command that ends up bringing down a missing 
object and 2) gc not compressing those pack files into a single pack file.

We all know that git doesn't scale well with a lot of pack files as it 
has to do a linear search through all the pack files when attempting to 
find an object.  I can see that very quickly, there would be a lot of 
pack files generated and with gc ignoring "partial" pack files, this 
would never get corrected.

In our usage scenarios, _all_ of the objects come from "partial" clones 
so all of our objects would end up in a series of "partial" pack files 
and would have pretty poor performance as a result.

I wondered if it is possible to flag a specific remote as "partial" and 
have fsck be able to track any given object back to the remote and then 
properly handle the fact that it was missing based on that. I couldn't 
think of a good way to do that without some additional data structure 
that would have to be build/maintained (ie promises).

That thinking did lead me back to wondering again if we could live with 
a repo specific flag.  If any clone/fetch was "partial" the flag is set 
and fsck ignore missing objects whether they came from a "partial" 
remote or not.

I'll admit it isn't as robust if someone is mixing and matching remotes 
from different servers some of which are partial and some of which are 
not.  I'm not sure how often that would actually happen but I _am_ 
certain a single repo specific flag is a _much_ simpler model than 
anything else we've come up with so far.

I doubt you want to treat all fetches/clones the same way as the
"lazy object" loading, though.  You may be critically rely on the
corporate central server that will give the objects it "promised"
when you cloned from it lazily (i.e. it may have given you a commit,
but not its parents or objects contained in its tree--you still know
that the parents and the tree and its contents will later be
available and rely on that fact).  You trust that and build on top,
so the packfile you obtained when you cloned from such a server
should count as "imported".  But if you exchanged wip changes with
your colleages by fetching or pushing peer-to-peer, without the
corporate central server knowing, you would want to treat objects in
packs (or loose objects) you obtained that way as "not imported".

That's true. I discussed this with a teammate and we might need to make
extensions.lazyObject be the name of the "corporate central server"
remote instead, and have a "loader" setting within that remote, so that
we can distinguish that objects from this server are "imported" but
objects from other servers are not.

The connectivity check shouldn't be slow in this case because fetches
are usually onto tips that we have (so we don't hit case 3).

Also I think "imported" vs "homegrown" may be a bit misnomer; the
idea to split objects into two camps sounds like a good idea, and
"imported" probably is an OK name to use for the category that is a
group of objects to which you know/trust are backed by your lazy
loader.  But the other one does not have to be "home"-grown.

Well, the names are not that important, but I think the line between
the two classes should not be "everything that came from clone and
fetch is imported", which is a more important point I am trying to
make.

Thanks.

Maybe "imported" vs "non-imported" would be better. I agree that the
objects in the non-"imported" group could still be obtained from
elsewhere.

Thanks for your comments.