Re: Fetch-hooks

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Fri, 09 Feb 2018 23:56:21 +0100

On Fri, Feb 09 2018, Leo Gaspard jotted:

> On 02/09/2018 11:04 PM, Ævar Arnfjörð Bjarmason wrote:>>> You could also
> have some intermediate step between these two, where
>>>> e.g. your refspec for "origin" is
>>>> "+refs/heads/*:refs/remotes/origin-untrusted/*" instead of the default
>>>> "+refs/heads/*:refs/remotes/origin/*", you fetch all refs to that
>>>> location, then you move them (with some alias/hook) to
>>>> "refs/remotes/origin/*" once they're seen to be "OK".
>>>
>>> That is indeed another possibility, but then the idea is to make things
>>> as transparent as possible for the end-user, not to completely change
>>> their git workflow. As such, considering only signed commits to be part
>>> of the upstream seems to make sense to me?
>>
>> I mean this would be something that would be part of a post-fetch hook,
>> so it would be as transparent as what you're doing to the user, with the
>> difference that it doesn't need to involve changes to what you slurp
>> down from the server.
>>
>> I.e. we'd just fetch into refs/remotes/origin-untrusted/, then we run
>> your post-fetch hook and you go over the new refs, and copy what you
>> like (usually everything) to refs/remotes/origin/*.
>
> Hmm... but that would then require a post-fetch hook, wouldn't it? And
> about a post-fetch hook, if I understood correctly, Junio in [1] had a
> quite nice argument against it:
>
>     Although I do not deeply care between such a "trigger to only
>     notify, no touching" hook and a full-blown "allow hook writers to
>     easily lie about what happened in the fetch" hook, I was hoping that
>     we would get this right and useful if we were spending our brain
>     bandwidth on it. I am not very fond of an easier "trigger to only
>     notify" hook because people are bound to misuse the interface and
>     try updating the refs anyway, making it easy to introduce
>     inconsistencies between refs and FETCH_HEAD that will confuse the
>     later "merge" step.
>
> Otherwise, if it doesn't require a post-fetch hook, then it would
> require the end-user to first fetch, then run the
> `copy-trusted-refs-over` script, which would add stuff to the user's
> workflow.
>
> Did I miss another possibility?

Yes, the assumption is that there would be a post-fetch hook of some
sort to ferry the refs over from the quarantine.

My reading of that thread is not that Junio's outright against such a
facility (and also it was 2011 and the discussion can be re-visited),
but rather that there's specific concerns that need to be kept in mind
so reliable things involving the ref store don't become brittle as a
result of unstable user hooks, or us offering an interface that's easily
misused.

>> [...]
>>
>> One thing that's not discussed yet, and I know just enough about for it
>> to tingle my spidey sense, but not enough to say for sure (CC'd Jeff &
>> Brandon who know more) is that this feature once shipped might cause
>> higher load on git hosting providers.
>>
>> This is because people will inevitably use it in popular projects for
>> some custom filtering, and because you're continually re-fetching and
>> inspecting stuff what used to be a really cheap no-op "pull" most of the
>> time is a more expensive negotiation every time before the client
>> rejects the refs again, and worse for hosting providers because you have
>> bespoke ref fetching strategies you have less odds of being able to
>> cache both the negotiation and the pack you serve.
>>
>> I.e. you want this for some security feature where 99.99% of the time
>> you accept all refs, but most people will probably use this to implement
>> dynamic Turing-complete refspecs.
>>
>> Maybe that's worrying about nothing, but worth thinking about.
>
> Well... First, I must say I didn't really understand your last paragraph
> about Turing-complete refspecs.
>
> But my understanding of how the fetch-hook patchset I sent this evening
> works is that it first receives all the objects from the hosting
> provider, then locally moves the refs, but never actually discards the
> downloaded objects (well, until a `git gc` I guess).
>
> So I don't think the network traffic with the provider would be any
> different wrt. what it is now, even if a tweak-fetch hook rejects some
> commits? Then again I don't know git's internals enough to be have even
> a bit of certainty about what I'm saying right now, so...
>
>
> [1] https://marc.info/?l=git&m=132480559712592&w=2

As Jeff notes in the side-thread in
<20180209223011.GA24578@xxxxxxxxxxxxxxxxxxxxx> when you do a "fetch" the
protocol is not negotiating on the basis of what loose unreferenced
(because you threw away the ref!) objects you have already, but what
the ref commonality is between you and the server (and this is just on
a best-effort basis).

So for example, let's say you only accept the "master" branch and have a
hook that's refusing updates to the "dev" branch, fetch is going to be
re-negotiating fetching the difference between the two over the wire
every time, only to have the hook say "no thanks" and throw away the
result.

I.e. as an extreme example, if the dev branch is adding a 1GB file
divergent from master, you are going to be re-downloading that 1GB every
time you fetch, even though you have that blob locally, the remote can't
know that, and because you have no ref pointing to it you can't mention
it during the negotiation.

Which is way less efficient than the case where your refspec only
fetches "master", since then we won't try at all, or the case where you
fetch both master & dev, but into some quarantine area and are just
deciding to update a local ref or not.