Re: Funnies with "git fetch"

Jeff King <peff@xxxxxxxx> · Thu, 1 Sep 2011 19:31:08 -0400

On Thu, Sep 01, 2011 at 03:42:46PM -0700, Junio C Hamano wrote:

> Junio C Hamano <gitster@xxxxxxxxx> writes:
> 
> > I think the breakages are:
> >
> >  - The sending side does not give any indication that it _wanted_ to send
> >    ce0136 but couldn't, and ended up sending another object;

This isn't fixable without the server re-hashing every outgoing object,
though, is it? It thinks it has ce0136, but the data is corrupted. The
only way to detect that is to rehash each object that we send. That's
pretty expensive when you consider decompression and delta
reconstruction.

Besides which, it doesn't help many malicious cases at all. It will
handle disk corruption or a malicious object-db tweak. But it does
nothing against an attacker who can trojan git, or who can spoof the
network session (which is probably hard to do if its straight ssh, but
keep in mind that traffic may be going through cleartext proxies).

So it seems that the receiving side is the only sensible place to put
such checks. It catches problems at more levels, and it can generally
afford the extra processing load (under the assumption that pushes to
servers are smaller and less frequent than fetches).

> >  - The pack data sent over the wire was self consistent (no breakage here)
> >    and sent three well-formed objects, but it was inconsistent with
> >    respect to what history was being transferred (breakage is here);

I don't think this matters. We have thin packs, after all. What goes on
the wire doesn't have to be sensible unto itself; it's about whether
the receiving end can do something useful with it.

> >  - The receiving end did not notice the inconsistency.
> >
> > The first one is of the lower priority, as the client side should be able
> > to notice an upstream with corruption in any case. Perhaps after asking
> > for objects between "have" and "want", "git fetch" should verify that it
> > can fully walk the subhistory that was supposed to be transferred down to
> > the blob level?

That seems like a sensible improvement to me.

> So I have a series to fix the latter "more important" half I'll be sending
> out in this thread.

If I understand correctly, your series is just about checking that we
have newly-referenced blobs. We were already checking commits and trees,
and we should already be hashing individual objects when we index the
pack. Right?

In that case, the performance change should be negligible.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html