Re: Resumable clone/Gittorrent (again)

Maaartin-1 <grajcar1@xxxxxxxxx> · Sat, 08 Jan 2011 02:04:40 +0100

On 11-01-06 07:36, Nguyen Thai Ngoc Duy wrote:
> On Thu, Jan 6, 2011 at 10:34 AM, Maaartin-1 <grajcar1@xxxxxxxxx> wrote:
>> In theory, I could create many commits per seconds. I could create many
>> unique paths per seconds, too. But I don't think it really happens. I do
>> know no larger repository than git.git and I don't want to download it
>> just to see how many commits, paths, and object it contains, but I'd
>> suppose it's less than one million commits, which should be manageable,
>> especially when commits get grouped together as I described below.
> 
> In pratice, commits are created every day in an active project. Paths
> on the other hand are added less often (perhaps except webkit).
> 
> I've got some numbers:
> 
>  - wine.git has 72k commits, 260k trees, 200k blobs, 12k paths
>  - git.git has 24k commits, 39k trees, 24k blobs, 2.7k paths
>  - linux-2.6.git has 160k commits, 760k trees, 442k blobs, 46k paths
> 
> Large repos are more interesting because small ones can be cloned with
> git-clone.

Sure. Linux is the winner and has 4 times as much commits as paths.

> Listing all those commits in linux-2.6.git takes 160k*20=3M (I suppose
> compressing is useless because SHA-1 is random). A compressed listing
> of those 46k paths takes 200k.

Sure, Linux has only 4 times as much commits as paths, but the commits
need 30 times more storage. What does it tell us?

IMHO it speaks in favor of my proposal. Imagine a path changing with
nearly every commit. The root directory is such a path and near top
directories come close to (as may other files like todo-lists do). For
each such file you need 3MB for storing the commits SHAs only. Of
course, you can invent a schema making storing all the SHAs unnecessary,
but this is another complication.

OTOH, with the commits used as directory entries we get quite a large
directory. Is this a problem you wanted me to get aware of?

> The point is you need to fetch its parent commits first in order to
> verify a commit. Fetching a whole commit is more expensive than a
> file. So while you can fetch a few commit bases and request for packs
> from those bases in parallel, the cost of initial commit bases will be
> high.

You've lost me. I assume you mean that something like that there may be
very large commits (e.g., in a project not versioned from the very
beginning). I'd suggest to split such commits in two parts by
classifying the blobs (and trees) using a fixed bit of their SHAs. Of
course, this can be repeated in order to get even smaller parts. Let's
assume a commit X gets split into X0 and X1. As before, for compressing
of X0 you may use the content any predecessor of X. For compressing of
X0 you may additionally use the content of X0. This way the compression
rate could stay close to optimal, IMHO.

> They are interchangeable as a whole, yes. But you cannot fetch half
> the pack from server A and the other half from server B. You can try
> to recover as many deltas as possible in a broken pack, but how do you
> request a server to send the rest of the pack to you?

Indeed, it's not resumable. For most commits it's not needed since they
are very small. Why? There are more commits than paths, so the commits
are smaller than paths on the average. I expect my schema to allow for
nearly as good compression as git usually does, especially I'd hope it's
no worse than when packing paths.

However, there may be very large commits in my schema (and maybe also
very large "path-packs" in yours). Such large commits get split as I
described above. Small commits get paired (possibly multiple times) as I
described earlier. You end up with only reasonably sized pieces of data,
let's say between 256 and 512 kB, so you don't need to resume.

Actually, with a really bad connection, you could ask the very server
from which you obtained an incomplete pack to resume from a given byte
offset (similar to HTTP ranges). The server may or may not have it. This
time it should try to keep it available for you in case you connections
abort again. Don't get me wrong -- this is just an additional help for
very badly connected people.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html