Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Fri, 21 Jul 2017 13:33:08 -0700

On Fri, 21 Jul 2017 12:24:52 -0400
Ben Peart <peartben@xxxxxxxxx> wrote:

> Today we have 3.5 million objects * 30 bytes per entry = 105 MB of 
> promises. Given the average developer only hydrates 56K files (2 MB 
> promises) that is 103 MB to download that no one will ever need. We 
> would like to avoid that if possible as this would be a significant 
> regression in clone times from where we are today.
> 
> I'm also concerned about the performance of merging in promises given we 
> have 100M objects today and growing so the number of promises over time 
> could get pretty large.

After some thought, maybe a hybrid solution is best, in which it is
permissible but optional for some missing objects to have promises. In
that case, it is more of a "size cache" (which stores the type as well)
rather than a true promise. When fetching, the client can optionally
request for the sizes and types of missing objects.

This is good for the large-blob case, in which we can always have size
information of missing blobs, and we can subsequently add blob-size
filtering (as a parameter) to "git log -S" and friends to avoid needing
to resolve a missing object. And this is, as far as I can tell, also
good for the many-blob case - just have an empty size cache all the
time. (And in the future, use cases could come up that desire non-empty
but non-comprehensive caches - for example, a directory lister working
on a partial clone that only needs to cache the sizes of frequently
accessed directories.)

Another option is to have a repo-wide option that toggles between
mandatory entries in the "size cache" and prohibited entries. Switching
to mandatory provides stricter fsck and negative lookups, but I think
it's not worth it for both the developers and users of Git to have to
know about these two modes.

> >> I think we should have a flag (off by default) that enables someone to
> >> say that promised objects are optional. If the flag is set,
> >> "is_promised_object" will return success and pass the OBJ_ANY type and a
> >> size of -1.
> >>
> >> Nothing today is using the size and in the two places where the object
> >> type is being checked for consistency (fsck_cache_tree and
> >> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
> >>
> >> This will enable very large numbers of objects to be omitted from the
> >> clone without triggering a download of the corresponding number of
> >> promised objects.
> > 
> > Eventually I plan to use the size when implementing parameters for
> > history-searching commands (e.g. "git log -S"), but it's true that
> > that's in the future.
> > 
> > Allowing promised objects to be optional would indeed solve the issue of
> > downloading too many promises. It would make the code more complicated,
> > but I'm not sure by how much.
> > 
> > For example, in this fsck patch, the easiest way I could think of to
> > have promised objects was to introduce a 3rd state, called "promised",
> > of "struct object" - one in which the type is known, but we don't have
> > access to the full "struct commit" or equivalent. And thus fsck could
> > assume that if the "struct object" is "parsed" or "promised", the type
> > is known. Having optional promised objects would require that we let
> > this "promised" state have a type of OBJ_UNKNOWN (or something like
> > that) - maybe that would be fine, but I haven't looked into this in
> > detail.
> > 
> 
> Caveats apply as I only did a quick look but I only found the two 
> locations that were checking the object type for consistency.

I haven't looked into detail, but you are probably right.