Re: Resumable clone/Gittorrent (again)

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> · Thu, 6 Jan 2011 13:36:57 +0700

On Thu, Jan 6, 2011 at 10:34 AM, Maaartin-1 <grajcar1@xxxxxxxxx> wrote:
> In theory, I could create many commits per seconds. I could create many
> unique paths per seconds, too. But I don't think it really happens. I do
> know no larger repository than git.git and I don't want to download it
> just to see how many commits, paths, and object it contains, but I'd
> suppose it's less than one million commits, which should be manageable,
> especially when commits get grouped together as I described below.

In pratice, commits are created every day in an active project. Paths
on the other hand are added less often (perhaps except webkit).

I've got some numbers:

 - wine.git has 72k commits, 260k trees, 200k blobs, 12k paths
 - git.git has 24k commits, 39k trees, 24k blobs, 2.7k paths
 - linux-2.6.git has 160k commits, 760k trees, 442k blobs, 46k paths

Large repos are more interesting because small ones can be cloned with
git-clone.

Listing all those commits in linux-2.6.git takes 160k*20=3M (I suppose
compressing is useless because SHA-1 is random). A compressed listing
of those 46k paths takes 200k.

>> And commits depend on other commits so
>> you can't verify a commit until you have got all of its parents. That
>> does apply to file, but then this file chain does not interfere other
>> file chains.
>
> That's true, but the verification is something done locally on the
> client, it consumes no network traffic and no server resources, so I
> consider it to be cheap. I need less than half a minute (using only a
> single core) for verifying of the whole git.git repository (36 MB). This
> is no problem, even when it had to wait until the download finishes. I'm
> sure, the OP of [1] would be happy if he could wait for this.

The point is you need to fetch its parent commits first in order to
verify a commit. Fetching a whole commit is more expensive than a
file. So while you can fetch a few commit bases and request for packs
from those bases in parallel, the cost of initial commit bases will be
high.

> I see I didn't explain it clear enough (or am missing something
> completely). I know why the packs normally used by git can't be used for
> this purpose. Let me retry: Let's assume there's a commit chain
> A-B-C-D-E-F-..., the client has already commit B and requests commit F.
> It may send requests to up to 4 servers, asking for C, D, E, and F,
> respectively. The server being asked for E _creates_ a pack containing
> all the information needed to create E given _all of_ A, B, C, D. As
> base for any blob/whatever in E it may choose any blob contained in any
> of these commits. Of course, it may also choose a blob already packed in
> this pack. It may not choose any other blob, so any client having all
> ancestors of E can use the pack. Different server and/or program
> versions may create different packs for E, but all of them are
> _interchangeable_. Because of this, it makes sense to _store_ it for
> future reuse.

They are interchangeable as a whole, yes. But you cannot fetch half
the pack from server A and the other half from server B. You can try
to recover as many deltas as possible in a broken pack, but how do you
request a server to send the rest of the pack to you?
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html