Re: Partial clone design (with connectivity check for locally-created objects)

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Mon, 7 Aug 2017 16:10:31 -0700

On Mon, 7 Aug 2017 15:12:11 -0400
Ben Peart <peartben@xxxxxxxxx> wrote:

> I missed the offline discussion and so am trying to piece together what 
> this latest design is trying to do.  Please let me know if I'm not 
> understanding something correctly.
> 
>  From what I can tell, objects are going to be segmented into two 
> "types" - those that were fetched from a remote source that allows 
> partial clones/fetches (lazyobject/imported) and those that come from 
> "regular" remote sources (homegrown) that requires all objects to exist 
> locally.
> 
> FWIW, the names here are not making things clearer for me. If I'm 
> correct perhaps "partial" and "normal" would be better to indicate the 
> type of the source? Anyway...

That's right. As for names, I'm leaning now towards "imported" and
"non-imported". "Partial" is a bit strange because such an object is
fully available; it's just that the objects that it references are
promised by the server.

> Once the objects are segmented into the 2 types, the fsck connectivity 
> check code is updated to ignore missing objects from "partial" remotes 
> but still expect/validate them from "normal" remotes.
> 
> This compromise seems reasonable - don't generate errors for missing 
> objects for remotes that returned a partial clone but do generate errors 
> for missing objects from normal clones as a missing object is always an 
> error in this case.

Yes. In addition, the references of "imported" objects are also
potentially used when connectivity-checking "non-imported" objects - if
a "non-imported" object refers to an object that an "imported" object
refers to, that is fine, even though we don't have that object.

> This segmentation is what is driving the need for the object loader to 
> build a new local pack file for every command that has to fetch a 
> missing object.  For example, we can't just write a tree object from a 
> "partial" clone into the loose object store as we have no way for fsck 
> to treat them differently and ignore any missing objects referenced by 
> that tree object.
> 
> My concern with this proposal is the combination of 1) writing a new 
> pack file for every git command that ends up bringing down a missing 
> object and 2) gc not compressing those pack files into a single pack file.
> 
> We all know that git doesn't scale well with a lot of pack files as it 
> has to do a linear search through all the pack files when attempting to 
> find an object.  I can see that very quickly, there would be a lot of 
> pack files generated and with gc ignoring "partial" pack files, this 
> would never get corrected.
> 
> In our usage scenarios, _all_ of the objects come from "partial" clones 
> so all of our objects would end up in a series of "partial" pack files 
> and would have pretty poor performance as a result.

One possible solution...would support for annotating loose objects with
".remote" be sufficient? (That is, for each loose object file created,
create another of the same name but with ".remote" appended.) This means
that a loose-object-creating lazy loader would need to create 2 files
per object instead of one.

The lazy loader protocol will thus be updated to something resembling a
prior version with the loader writing objects directly to the object
database, but now the loader is also responsible for creating the
".remote" files.  (In the Android use case, we probably won't need the
writing-to-partial-packfile mechanism anymore since only comparatively
few and large blobs will go in there.)