Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs

Jonathan Nieder <jrnieder@xxxxxxxxx> · Fri, 14 Jul 2017 14:30:18 -0700

Jeff Hostetler wrote:
> On 7/13/2017 3:39 PM, Jonathan Tan wrote:

>> I know that discussion has shifted to the possibility of not having this
>> list at all, and not sending size information together with the fetch,
>> but going back to this...maybe omitting trees *is* the solution to both
>> the large local list and the large amount of size information needing to
>> be transferred.
>>
>> So the large-blob (e.g. Android) and many-blob (e.g. Windows) cases
>> would look like this:
>>
>>   * Large-blob repositories have no trees omitted and a few blobs
>>     omitted, and we have sizes for all of them.
>>   * Many-blob repositories have many trees omitted and either all
>>     blobs omitted (and we have size information for them, useful for FUSE
>>     or FUSE-like things, for example) or possibly no blobs omitted (for
>>     example, if shallow clones are going to be the norm, there won't be
>>     many blobs to begin with if trees are omitted).
>
> I'm not sure I understand what you're saying here.  Does omitting a tree
> object change the set of blob sizes we receive?  Are you saying that if
> we omit a tree, then we implicitly omit all the blobs it references and
> don't send size info those blobs?  So that the local list only has
> reachable objects?  So faulting-in a tree would also have to send size
> info for the newly referenced blobs?
>
> Would this make it more similar to a shallow clone (in that none of the
> have_object tests work for items beyond the cut point) ?

Correct.  After the server sends a promise instead of a tree object, the
client has no reason to try to access blobs pointed to by that tree, any
more than it has reason to try to access commits on a branch it has not
fetched.  This means the client does not have to be aware of those blobs
until it fetches the tree and associated blob promises.

[...]
> For the former case, if you just have a few omitted objects, then a
> second round-trip to mget their sizes isn't that much work.

For the client, that is true.  For the server, decreasing the number
of requests even when requests are small and fast can be valuable.

[...]
> I think for the latter, forcing a full promise-list on clone is just
> too much data to send -- data that we likely won't ever need.

What did you think of the suggestion to not send promises for objects
that are only referenced by objects that weren't sent?

[...]
>> What do you think of doing this:
>>   * add a "type" field to the list of promised objects (formerly the list
>>     of promised blobs)
>>   * retain mandatory size for blobs
>>   * retain single file containing list of promised objects (I don't feel
>>     too strongly about this, but it has a slight simplicity and
>>     in-between-GC performance advantage)
>
> The single promise-set is problematic.  I think it will grow too
> large (in our case) and will need all the usual lock juggling
> and merging.
>
> I still prefer my suggestion for a per-packfile promise-set for all
> of the reasons I stated the other day.  This can be computed quickly
> during index-pack, is (nearly) read-only, and doesn't require the
> whole file rewrite lock file.  It also has the benefit of being
> portable -- in that I can also copy the .promise file if I copy the
> .pack and .idx file to another repo.

Okay.

Thanks,
Jonathan