Re: Re-Transmission of blobs?

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 10 Sep 2013 10:51:02 -0700

Josef Wolf <jw@xxxxxxxxxxxxx> writes:

> as we all know, files are identified by their SHA. Thus I had the impression
> that when transfering files, git would know by the SHA whether a given file is
> already available in the destination repository and the transfer would be of
> no use.

That is unfortunately not how things work.  It is not like the
receiving end sends the names of all objects it has, and the sending
end excludes these objects from what it is going to send.

Consider this simple history with only a handful of commits (as
usual, time flows from left to right):

              E
             /   
    A---B---C---D

where D is at the tip of the sending side, E is at the tip of the
receiving side.  The exchange goes roughly like this:

    (receiving side): what do you have?

    (sending side): my tip is at D.

    (receiving side): D?  I've never heard of it --- please give it
                      to me.  I have E.

    (sending side): E?  I don't know about it; must be something you
                    created since you forked from me.  Tell me about
                    its ancestors.

    (receiving side): OK, I have C.

    (sending side): Oh, C I know about. You do not have to tell me
                    anything more.  A packfile to bring you up to
                    date will follow.

At this point, the sender knows that the receiver needs the commit
D, and trees and blobs in D.  It does also know it has the commit C
and trees and blobs in C.  It does the best thing it can do using
these (and only these) information, namely, to send the commit D,
and send trees and blobs in D that are not in the commit C.

You may happen to have something in E that match what is in D but
not in C.  Because the sender does not know anything about E at all
in the first place, that information cannot be used to reduce the
transfer.

The sender theoretically _could_ also exploit the fact that any
receiver that has C must have B and A and all trees and blobs
associated with these ancestor commits [*1*], but that information
is not currently discovered nor used during the object transfer.

There may happen to be a tree or a blob in A that matches a tree or
a blob in D.  But because the common ancestor discovery exchange
above stops at C, the sender does not bother enumerating all the
objects that are in the ancestor commits of C when figuring out what
objects to send to ensure that the receiving end has all the objects
necessary to complete D.  If you modified a blob at B (or C) and
then resurrected the old version of the blob at D, it is likely that
the blob is going to be sent again when the receiving end asks for
D.

There are some work being done to optimize this further using
various techniques, but they are not ready yet.

[Footnote]

*1* only down to the shallow boundary, if the receiving end is a
shallow clone.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html