On 7/21/2017 4:33 PM, Jonathan Tan wrote:
On Fri, 21 Jul 2017 12:24:52 -0400
Ben Peart <peartben@xxxxxxxxx> wrote:
Today we have 3.5 million objects * 30 bytes per entry = 105 MB of
promises. Given the average developer only hydrates 56K files (2 MB
promises) that is 103 MB to download that no one will ever need. We
would like to avoid that if possible as this would be a significant
regression in clone times from where we are today.
I'm also concerned about the performance of merging in promises given we
have 100M objects today and growing so the number of promises over time
could get pretty large.
After some thought, maybe a hybrid solution is best, in which it is
permissible but optional for some missing objects to have promises. In
that case, it is more of a "size cache" (which stores the type as well)
rather than a true promise. When fetching, the client can optionally
request for the sizes and types of missing objects.
In our GVFS solution today we do not download any size or object type
information at clone as the number of objects and the resulting file
would be too large. Instead, we have a new sizes endpoint
(https://github.com/Microsoft/GVFS/blob/master/Protocol.md) that enables
us to retrieve object sizes "on demand" much like we are enabling for
the actual object content.
This protocol could easily be extended to return both size and type so
that it could be used to retrieve "promise" data for objects as they are
needed. Having a way to "cache" that data locally so that both git and
other code could share it would be great.
At a minimum, we should ensure the data stream passed back is the same
whether at clone time or when hitting a "promises" end point. I think it
would also be helpful to enable promises to be downloaded on demand much
like we are doing for the object data itself.
This is good for the large-blob case, in which we can always have size
information of missing blobs, and we can subsequently add blob-size
filtering (as a parameter) to "git log -S" and friends to avoid needing
to resolve a missing object. And this is, as far as I can tell, also
good for the many-blob case - just have an empty size cache all the
time. (And in the future, use cases could come up that desire non-empty
but non-comprehensive caches - for example, a directory lister working
on a partial clone that only needs to cache the sizes of frequently
accessed directories.)
Another option is to have a repo-wide option that toggles between
mandatory entries in the "size cache" and prohibited entries. Switching
to mandatory provides stricter fsck and negative lookups, but I think
it's not worth it for both the developers and users of Git to have to
know about these two modes.
I think we should have a flag (off by default) that enables someone to
say that promised objects are optional. If the flag is set,
"is_promised_object" will return success and pass the OBJ_ANY type and a
size of -1.
Nothing today is using the size and in the two places where the object
type is being checked for consistency (fsck_cache_tree and
fsck_handle_ref) the test can add a test for OBJ_ANY as well.
This will enable very large numbers of objects to be omitted from the
clone without triggering a download of the corresponding number of
promised objects.
Eventually I plan to use the size when implementing parameters for
history-searching commands (e.g. "git log -S"), but it's true that
that's in the future.
Allowing promised objects to be optional would indeed solve the issue of
downloading too many promises. It would make the code more complicated,
but I'm not sure by how much.
For example, in this fsck patch, the easiest way I could think of to
have promised objects was to introduce a 3rd state, called "promised",
of "struct object" - one in which the type is known, but we don't have
access to the full "struct commit" or equivalent. And thus fsck could
assume that if the "struct object" is "parsed" or "promised", the type
is known. Having optional promised objects would require that we let
this "promised" state have a type of OBJ_UNKNOWN (or something like
that) - maybe that would be fine, but I haven't looked into this in
detail.
Caveats apply as I only did a quick look but I only found the two
locations that were checking the object type for consistency.
I haven't looked into detail, but you are probably right.