On Wed, 2009-08-19 at 17:13 -0400, Nicolas Pitre wrote: > > It's the "cheaply deepen history" that I doubt would be easy. This is > > the most difficult part, I think (see also below). > > Don't think so. Try this: > > mkdir test > cd test > git init > git fetch --depth=1 git://git.kernel.org/pub/scm/git/git.git > > REsult: > > remote: Counting objects: 1824, done. > remote: Compressing objects: 100% (1575/1575), done. > Receiving objects: 100% (1824/1824), 3.01 MiB | 975 KiB/s, done. > remote: Total 1824 (delta 299), reused 1165 (delta 180) > Resolving deltas: 100% (299/299), done. > From git://git.kernel.org/pub/scm/git/git > * branch HEAD -> FETCH_HEAD > > You'll get the very latest revision for HEAD, and only that. The size > of the transfer will be roughly the size of a daily snapshot, except it > is fully up to date. It is however non resumable in the event of a > network outage. My proposal is to replace this with a "git archive" > call. It won't get all branches, but for the purpose of initialising > one's repository that should be good enough. And the "git archive" can > be fully resumable as I explained. > > Now to deepen that history. Let's say you want 10 more revisions going > back then you simply perform the fetch again with a --depth=10. Right > now it doesn't seem to work optimally, but the pack that is then being > sent could be made of deltas against objects found in the commits we > already have. Currently it seems that a pack that also includes those > objects we already have in addition to those we want is created, which > is IMHO a flaw in the shallow support that shouldn't be too hard to fix. > Each level of deepening should then be as small as standard fetches > going forward when updating the repository with new revisions. Nicholas, apart from starting with most recent commits and working backwards, this is very similar to the "bundle slicing" idea defined in GitTorrent. What the GitTorrent research project has so far achieved is defining a slicing algorithm, and figuring out how well slicing works, in terms of wasted bandwidth. If you do it right, then you can support download spreading across mirrors, too. Eg, given a starting point, a 'slice size' - which I based on uncompressed object size but could as well be based on commit count - and a slice number to fetch, you should be able to look up in the revision list index the revisions to select and then make a thin pack corresponding to those commits. Currently creating this index is the slowest part of creating bundle fragments in my Perl implementation. Once Nick Edelen's project is mergeable, we have a mechanism for being able to relatively quickly draw a manifest of objects for these slices. So how much bandwidth is lost? Eg, for git.git, taking the complete object list, slicing it into 1024k (uncompressed) bundle slices, and making thin packs from those slices we get: Generating index... Length is 1291327524, 1232 blocks Slice #0: 1050390 => 120406 (11%) Slice #1: 1058162 => 124978 (11%) Slice #2: 1049858 => 104363 (9%) ... Slice #51: 1105090 => 43140 (3%) Slice #52: 1091282 => 45367 (4%) Slice #53: 1067675 => 39792 (3%) ... Slice #211: 1086238 => 25451 (2%) Slice #212: 1055705 => 31294 (2%) Slice #213: 1059460 => 7767 (0%) ... Slice #1129: 1109209 => 38182 (3%) Slice #1130: 1125925 => 29829 (2%) Slice #1131: 1120203 => 14446 (1%) Final slice: 623055 => 49345 Overall compressed: 39585851 Calculating Repository bundle size... Counting objects: 107369, done. Compressing objects: 100% (28059/28059), done. Writing objects: 100% (107369/107369), 29.20 MiB | 48321 KiB/s, done. Total 107369 (delta 78185), reused 106770 (delta 77609) Bundle size: 30638967 Overall inefficiency: 29% In the above output, the first figure is the complete un-delta'd, uncompressed size of the slice - that is, the size of all of the new objects that the commit introduces. The second figure is the full size of a thin pack with those objects in it. ie the above tells me that in git.git there are 1.2GB of uncompressed objects. Each slice ends up varying in size between about 10k and 200k, but most of the slices end up between 15k and 50k. Actually the test script was thrown off by a loose root and that added about 3MB to the compressed size, so the overall inefficiency with this block size is actually more like 20%. I think I am running into the flaw you mention above, too, especially when I do a larger block size run: Generating index... Length is 1291327524, 62 blocks Slice #0: 21000218 => 1316165 (6%) Slice #1: 20988208 => 1107636 (5%) ... Slice #59: 21102776 => 1387722 (6%) Slice #60: 20974960 => 876648 (4%) Final slice: 6715954 => 261218 Overall compressed: 50071857 Calculating Repository bundle size... Counting objects: 107369, done. Compressing objects: 100% (28059/28059), done. Writing objects: 100% (107369/107369), 29.20 MiB | 48353 KiB/s, done. Total 107369 (delta 78185), reused 106770 (delta 77609) Bundle size: 30638967 Overall inefficiency: 63% Somehow we made larger packs but the total packed size was larger. Trying with 100MB "blocks" I get: Generating index... Length is 1291327524, 13 blocks Slice #0: 104952661 => 4846553 (4%) Slice #1: 104898188 => 2830056 (2%) Slice #2: 105007998 => 2856535 (2%) Slice #3: 104909972 => 2583402 (2%) Slice #4: 104909440 => 2187708 (2%) Slice #5: 104859786 => 2555686 (2%) Slice #6: 104873317 => 2358914 (2%) Slice #7: 104881597 => 2183894 (2%) Slice #8: 104863418 => 3555224 (3%) Slice #9: 104896599 => 3192564 (3%) Slice #10: 104876697 => 3895707 (3%) Slice #11: 104903491 => 3731555 (3%) Final slice: 32494360 => 1270887 Overall compressed: 38048685 Calculating Repository bundle size... Counting objects: 107369, done. Compressing objects: 100% (28059/28059), done. Writing objects: 100% (107369/107369), 29.20 MiB | 48040 KiB/s, done. Total 107369 (delta 78185), reused 106770 (delta 77609) Bundle size: 30638967 Overall inefficiency: 24% In the above, we broke the git.git download into 13 partial downloads of a few meg each, at the expense of an extra 24% of download. Anyway I've hopefully got more to add to this but this will do for a starting point. Sam -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html