Re: Proposal for missing blob support in Git repos

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Tue, 2 May 2017 20:32:05 +0200

On Tue, May 2, 2017 at 7:21 PM, Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote:
> On Mon, May 1, 2017 at 6:41 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>> Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:
>>
>>> On 05/01/2017 04:29 PM, Junio C Hamano wrote:
>>>> Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:
>>>>
>>>>> Thanks for your comments. If you're referring to the codepath
>>>>> involving write_sha1_file() (for example, builtin/hash-object ->
>>>>> index_fd or builtin/unpack-objects), that is fine because
>>>>> write_sha1_file() invokes freshen_packed_object() and
>>>>> freshen_loose_object() directly to check if the object already exists
>>>>> (and thus does not invoke the new mechanism in this patch).
>>>>
>>>> Is that a good thing, though?  It means that you an attacker can
>>>> feed one version to the remote object store your "grab blob" hook
>>>> gets the blobs from, and have you add a colliding object locally,
>>>> and the usual "are we recording the same object as existing one?"
>>>> check is bypassed.
>>>
>>> If I understand this correctly, what you mean is the situation where
>>> the hook adds an object to the local repo, overriding another object
>>> of the same name?
>>
>> No.
>>
>> write_sha1_file() pays attention to objects already in the local
>> object store to avoid hash collisions that can be used to replace a
>> known-to-be-good object and that is done as a security measure.
>> What I am reading in your response was that this new mechanism
>> bypasses that, and I was wondering if that is a good thing.
>
> Oh, what I meant was that write_sha1_file() bypasses the new
> mechanism, not that the new mechanism bypasses the checks in
> write_sha1_file().
>
> To be clear, here's what happens when write_sha1_file() is invoked
> (before and after this patch - this patch does not affect
> write_sha1_file at all):
> 1. (some details omitted)
> 2. call freshen_packed_object
> 3, call freshen_loose_object if necessary
> 4. write object (if freshen_packed_object and freshen_loose_object do
> not both return 0)
>
> Nothing changes in this patch (whether a hook is defined or not).

But don't the semantics change in the sense that before
core.missingBlobCommand you couldn't write a new blob SHA1 that was
already part of your history, whereas with this change
write_sha1_file() might write what it considers to be a new blob, but
it's actually colliding with an existing blob, but write_sha1_file()
doesn't know that because knowing would involve asking the hook to
fetch the blob?

> And here's what happens when has_sha1_file (or another function listed
> in the commit message) is invoked:
> 1. check for existence of packed object of the requested name
> 2. check for existence of loose object of the requested name
> 3. check again for existence of packed object of the requested name
> 4. if a hook is defined, invoke the hook and repeat 1-3
>
> Here, in step 4, the hook could do whatever it wants to the repository.

This might be a bit of early bikeshedding, but then again the lack of
early bikeshedding tends to turn into standards.

Wouldn't it be better to name this core.missingObjectCommand & have
the hook take a list on stdin like:

    <id> <TAB> <object_type> <TAB> <object_id> <TAB> <request_type> <TAB> [....]

And have the hook respond:

    <id> <TAB> <status> <TAB> [....]

I.e. what you'd do now is send this to the hook:

    1 <TAB> blob <TAB> <sha1> <TAB> missing

And the hook would respond:

    1 <TAB> ok

But this leaves open the door addressing this potential edge case with
writing new blobs in the future, i.e. write_sha1_file() could call it
as:

    1 <TAB> blob <TAB> <sha1> <TAB> new

And the hook could either respond immediately as:

    1 <TAB> ok

If it's in some #YOLO mode where it's not going to check for colliding
blobs over the network, or alternatively & ask the parent repo if it
has those blobs, and if so print:

    1 <TAB> collision

Or something like that.

This also enables future lazy loading of trees/commits from the same
API, and for the hook to respond out-of-order to the input it gets as
it can, since each request is prefixed with an incrementing request
id.