Re: Questions about git-push for huge repositories

Jeff King <peff@xxxxxxxx> · Tue, 8 Sep 2015 01:00:27 -0400

On Mon, Sep 07, 2015 at 09:05:41AM +0800, Levin Du wrote:

> > Instead, the object transfer is optimized by comparing what commits
> > each side has and sending trees and blobs that are reachable from
> > the commits that the receiving side does not have.
> 
> The sender A sends all the commits that the receiver B does not have.
> The commits contains trees and blobs. In my situation, branch in A has
> only one commit. It seems that B has received lots of duplicate blobs,
> concluded from the GC result.

Right. B tells A "I already have this commit", but A does not already
have it, so that information is not helpful. It cannot make any
assumptions about what B has, and must send all trees and blobs
referenced by its commit.

> What I do not understand is, how duplicate blobs happen in a git repository?
> Git repository is famous for its content addressing storage system.
> I guess that A sends its packed file to B directly, no matter what are
> already in B.

Not exactly.  During a push, git may or may not keep the packfile sent
over the wire, depending on the number of objects in it and the
receive.unpackLimit config setting. The same object can exist in two
separate packfiles. One of the effects of "git gc" is to remove such
duplicates.

So A effectively does send its whole pack in this case, but only because
it cannot find any shared history with B (and B keeps it as-is until the
next gc because it is over the unpackLimit).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html