On Sat, Jan 8, 2011 at 8:04 AM, Maaartin-1 <grajcar1@xxxxxxxxx> wrote: >> Listing all those commits in linux-2.6.git takes 160k*20=3M (I suppose >> compressing is useless because SHA-1 is random). A compressed listing >> of those 46k paths takes 200k. > > Sure, Linux has only 4 times as much commits as paths, but the commits > need 30 times more storage. What does it tell us? > > IMHO it speaks in favor of my proposal. Imagine a path changing with > nearly every commit. The root directory is such a path and near top > directories come close to (as may other files like todo-lists do). For > each such file you need 3MB for storing the commits SHAs only. Of > course, you can invent a schema making storing all the SHAs unnecessary, > but this is another complication. > > OTOH, with the commits used as directory entries we get quite a large > directory. Is this a problem you wanted me to get aware of? I merely point out that if we use commit sha-1 as "pieces". Then when a new peer comes in and ask a running peer "what pieces have you got (so that I can start fetching from you)?", you will need more bandwidth for that kind of information. >> The point is you need to fetch its parent commits first in order to >> verify a commit. Fetching a whole commit is more expensive than a >> file. So while you can fetch a few commit bases and request for packs >> from those bases in parallel, the cost of initial commit bases will be >> high. > > You've lost me. I assume you mean that something like that there may be > very large commits (e.g., in a project not versioned from the very > beginning). I'd suggest to split such commits in two parts by > classifying the blobs (and trees) using a fixed bit of their SHAs. Of > course, this can be repeated in order to get even smaller parts. Let's > assume a commit X gets split into X0 and X1. As before, for compressing > of X0 you may use the content any predecessor of X. For compressing of > X0 you may additionally use the content of X0. This way the compression > rate could stay close to optimal, IMHO. Well if you are going to split a commit, then splitting by paths sounds more natural to me (assume that people don't often move files). >> They are interchangeable as a whole, yes. But you cannot fetch half >> the pack from server A and the other half from server B. You can try >> to recover as many deltas as possible in a broken pack, but how do you >> request a server to send the rest of the pack to you? > > Indeed, it's not resumable. For most commits it's not needed since they > are very small. Why? There are more commits than paths, so the commits > are smaller than paths on the average. I expect my schema to allow for > nearly as good compression as git usually does, especially I'd hope it's > no worse than when packing paths. A commit diff consists of all tree and blobs diff compared to the parent commit (let's ignore merges). How can it be smaller than just a single tree/blob diff (of the same path, compared to the parent commit)? > However, there may be very large commits in my schema (and maybe also > very large "path-packs" in yours). Such large commits get split as I > described above. Small commits get paired (possibly multiple times) as I > described earlier. You end up with only reasonably sized pieces of data, > let's say between 256 and 512 kB, so you don't need to resume. Yeah, I started thinking how to transfer effieciently and I came to similar thing: assume we have good order of packing and know what to pack, then we close the pack when its size is greater than a limit and start sending another pack. If this pack stream is corrupt, resume from the corrupt pack forward. I currently hardcode the limit 4k, not greater because pack overhead is very low already (12 header bytes and 20 sha1 trailer bytes each pack). > Actually, with a really bad connection, you could ask the very server > from which you obtained an incomplete pack to resume from a given byte > offset (similar to HTTP ranges). The server may or may not have it. This > time it should try to keep it available for you in case you connections > abort again. Don't get me wrong -- this is just an additional help for > very badly connected people. Replace "byte range" with rev-list syntax (SHA1~10..SHA1-20) then we have quite fine-grained way of asking for data. Deltas are usually very small (I observed cache.h only so this could be a wrong assumption). But if SHA1~10..SHA1~11 has too big diff that keeps failing, sending byte ranges of the blob in SHA1~11 is probably the only way left. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html