On 11-01-06 02:32, Nguyen Thai Ngoc Duy wrote: > On Thu, Jan 6, 2011 at 6:28 AM, Maaartin <grajcar1@xxxxxxxxx> wrote: >> Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes: >> I haven't read the whole other thread yet, but what about going the other way >> round? Use a single commit as a chain, create deltas assuming that all >> ancestors are already available. The packs may arrive out of order, so the >> decompression may have to wait. The number of commits may be one order of >> magnitude larger than the the number of paths (there are currently 2254 paths >> and 24235 commits in git.git), so grouping consequent commits into one larger >> pack may be useful. > > The number of commits can increase fast. I'd rather have a > small/stable number over time. In theory, I could create many commits per seconds. I could create many unique paths per seconds, too. But I don't think it really happens. I do know no larger repository than git.git and I don't want to download it just to see how many commits, paths, and object it contains, but I'd suppose it's less than one million commits, which should be manageable, especially when commits get grouped together as I described below. > And commits depend on other commits so > you can't verify a commit until you have got all of its parents. That > does apply to file, but then this file chain does not interfere other > file chains. That's true, but the verification is something done locally on the client, it consumes no network traffic and no server resources, so I consider it to be cheap. I need less than half a minute (using only a single core) for verifying of the whole git.git repository (36 MB). This is no problem, even when it had to wait until the download finishes. I'm sure, the OP of [1] would be happy if he could wait for this. >> The advantage is that the packs stays stable over time, you may create them >> using the most aggressive and time-consuming settings and store them forever. >> You could create packs for single commits, packs for non-overlapping >> consecutive pairs of them, for non-overlapping pairs of pairs, etc. I mean with >> commits numbered 0, 1, 2, ... create packs [0,1], [2,3], ..., [0,3], [4,7], >> etc. The reason for this is obviously to allow reading groups of commits from >> different servers so that they fit together (similar to Buddy memory >> allocation). Of course, there are things like branches bringing chaos in this >> simple scheme, but I'm sure this can be solved somehow. > > Pack encoding can change. I see I didn't explain it clear enough (or am missing something completely). I know why the packs normally used by git can't be used for this purpose. Let me retry: Let's assume there's a commit chain A-B-C-D-E-F-..., the client has already commit B and requests commit F. It may send requests to up to 4 servers, asking for C, D, E, and F, respectively. The server being asked for E _creates_ a pack containing all the information needed to create E given _all of_ A, B, C, D. As base for any blob/whatever in E it may choose any blob contained in any of these commits. Of course, it may also choose a blob already packed in this pack. It may not choose any other blob, so any client having all ancestors of E can use the pack. Different server and/or program versions may create different packs for E, but all of them are _interchangeable_. Because of this, it makes sense to _store_ it for future reuse. Compared to the way git packing normally works, this is a restriction, but I don't think it leads to significantly worse compression. You guys working on git can confirm or disprove it. > And packs can contain objects you don't want > to share (i.e. hidden from public view). This pack would contain only commit E. I also described pairing intended for greater efficiency. In this case a server creates a pack allowing e.g. to create commits E and F given all their ancestors (while other server creates a pack for C and D). This way the number of packs needed may be a fraction of the total number of commits requested. >> Another problem is the client requesting commits A and B while declaring to >> possess commits C and D. When both C and D are ancestors of either A or B, you >> can ignore it (as you assume this while packing, anyway). The other case is >> less probable, unless e.g. C is the master and A is a developing branch. >> Currently. I've no idea how to optimize this and whether this could be >> important. > > As I said, we can request just part of a chain (from A+B to C+D). > git-fetch should be used if the repo is quite uptodate though. It's > just more efficient. [1] http://article.gmane.org/gmane.comp.version-control.git/164564 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html