Re: [PATCH v5 25/40] external-odb: add 'get_direct' support

Christian Couder <christian.couder@xxxxxxxxx> · Thu, 14 Sep 2017 10:39:35 +0200

On Thu, Aug 3, 2017 at 11:40 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Christian Couder <christian.couder@xxxxxxxxx> writes:
>
>> This implements the 'get_direct' capability/instruction that makes
>> it possible for external odb helper scripts to pass blobs to Git
>> by directly writing them as loose objects files.
>
> I am not sure if the assumption is made clear in this series, but I
> am (perhaps incorrectly) guessing that it is assumed that the
> intended use of this feature is to offload access to large blobs
> by not including them in the initial clone.

Yeah, it could be used for that, but that's not the only interesting use case.

It could also be used for example if the working tree contains a huge
number of blobs and it is better to download only the blobs that are
needed when they are needed. In fact the code for 'get_direct' was
taken from Ben Peart's "read-object" patch series (actually from an
earlier version of this patch series):

https://public-inbox.org/git/20170714132651.170708-1-benpeart@xxxxxxxxxxxxx/

> So from that point of
> view, I think it makes tons of sense to let the external helper to
> directly populate the database bypassing Git (i.e. instead of
> feeding data stream and have Git store it) like this "direct" method
> does.
>
> How does this compare with (and how well does this work with) what
> Jonathan Tan is doing recently?

>From the following email:

https://public-inbox.org/git/20170804145113.5ceafafa@xxxxxxxxxxxxxxxxxxxxxxxxxxx/

it looks like his work is fundamentally about changing the rules of
connectivity checks. Objects are split between "homegrown" objects and
"imported" objects which are in separate pack files. Then references
to imported objects are not checked during connectivity check.

I think changing connectivity rules is not necessary to make something
like external odb work. For example when fetching a pack that refers
to objects that are in an external odb, if access this external odb
has been configured, then the connectivity check will pass as the
missing objects in the pack will be seen as already part of the repo.

Yeah, if some commands like fsck are used, then possibly all the
objects will have to be requested from the external odb, as it may not
be possible to fully check all the objects, especially the blobs,
without accessing all their data. But I think this is a problem that
could be dealt with in different ways. For example we could develop
specific options in fsck so that it doesn't check the sha1 of objects
that are marked with some specific attributes, or that are stored in
external odbs, or that are bigger than some size.