On Tue, May 2, 2017 at 7:21 PM, Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote: > On Mon, May 1, 2017 at 6:41 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote: >> Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes: >> >>> On 05/01/2017 04:29 PM, Junio C Hamano wrote: >>>> Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes: >>>> >>>>> Thanks for your comments. If you're referring to the codepath >>>>> involving write_sha1_file() (for example, builtin/hash-object -> >>>>> index_fd or builtin/unpack-objects), that is fine because >>>>> write_sha1_file() invokes freshen_packed_object() and >>>>> freshen_loose_object() directly to check if the object already exists >>>>> (and thus does not invoke the new mechanism in this patch). >>>> >>>> Is that a good thing, though? It means that you an attacker can >>>> feed one version to the remote object store your "grab blob" hook >>>> gets the blobs from, and have you add a colliding object locally, >>>> and the usual "are we recording the same object as existing one?" >>>> check is bypassed. >>> >>> If I understand this correctly, what you mean is the situation where >>> the hook adds an object to the local repo, overriding another object >>> of the same name? >> >> No. >> >> write_sha1_file() pays attention to objects already in the local >> object store to avoid hash collisions that can be used to replace a >> known-to-be-good object and that is done as a security measure. >> What I am reading in your response was that this new mechanism >> bypasses that, and I was wondering if that is a good thing. > > Oh, what I meant was that write_sha1_file() bypasses the new > mechanism, not that the new mechanism bypasses the checks in > write_sha1_file(). > > To be clear, here's what happens when write_sha1_file() is invoked > (before and after this patch - this patch does not affect > write_sha1_file at all): > 1. (some details omitted) > 2. call freshen_packed_object > 3, call freshen_loose_object if necessary > 4. write object (if freshen_packed_object and freshen_loose_object do > not both return 0) > > Nothing changes in this patch (whether a hook is defined or not). But don't the semantics change in the sense that before core.missingBlobCommand you couldn't write a new blob SHA1 that was already part of your history, whereas with this change write_sha1_file() might write what it considers to be a new blob, but it's actually colliding with an existing blob, but write_sha1_file() doesn't know that because knowing would involve asking the hook to fetch the blob? > And here's what happens when has_sha1_file (or another function listed > in the commit message) is invoked: > 1. check for existence of packed object of the requested name > 2. check for existence of loose object of the requested name > 3. check again for existence of packed object of the requested name > 4. if a hook is defined, invoke the hook and repeat 1-3 > > Here, in step 4, the hook could do whatever it wants to the repository. This might be a bit of early bikeshedding, but then again the lack of early bikeshedding tends to turn into standards. Wouldn't it be better to name this core.missingObjectCommand & have the hook take a list on stdin like: <id> <TAB> <object_type> <TAB> <object_id> <TAB> <request_type> <TAB> [....] And have the hook respond: <id> <TAB> <status> <TAB> [....] I.e. what you'd do now is send this to the hook: 1 <TAB> blob <TAB> <sha1> <TAB> missing And the hook would respond: 1 <TAB> ok But this leaves open the door addressing this potential edge case with writing new blobs in the future, i.e. write_sha1_file() could call it as: 1 <TAB> blob <TAB> <sha1> <TAB> new And the hook could either respond immediately as: 1 <TAB> ok If it's in some #YOLO mode where it's not going to check for colliding blobs over the network, or alternatively & ask the parent repo if it has those blobs, and if so print: 1 <TAB> collision Or something like that. This also enables future lazy loading of trees/commits from the same API, and for the hook to respond out-of-order to the input it gets as it can, since each request is prefixed with an incrementing request id.