Peff first of all thanks for feedback, On Thu, Jul 07, 2016 at 04:52:23PM -0400, Jeff King wrote: > On Thu, Jul 07, 2016 at 10:09:17PM +0300, Kirill Smelkov wrote: > > > Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects) > > if a repository has bitmap index, pack-objects can nicely speedup > > "Counting objects" graph traversal phase. That however was done only for > > case when resultant pack is sent to stdout, not written into a file. > > > > We can teach pack-objects to use bitmap index for initial object > > counting phase when generating resultant pack file too: > > I'm not sure this is a good idea in general. When bitmaps are in use, we > cannot fill out the details in the object-packing list as thoroughly. In > particular: > > - we will not compute the same write order (which is based on > traversal order), leading to packs that have less efficient cache > characteristics I agree the order can be not exactly the same. Still if original pack is packed well (with good recency order), while using bitmap we will tend to traverse it in close to original order. Maybe I'm not completely right on this, but to me it looks to be the case because if objects in original pack are put there linearly sorted by recency order, and we use bitmap index to set of all reachable objects from a root, and then just _linearly_ gather all those objects from original pack by 1s in bitmap and put them in the same order into destination pack, the recency order won't be broken. Or am I maybe misunderstanding something? Please also see below: > - we don't learn about the filename of trees and blobs, which is going > to make the delta step much less efficient. This might be mitigated > by turning on the bitmap name-hash cache; I don't recall how much > detail pack-objects needs on the name (i.e., the full name versus > just the hash). If I understand it right, it uses only uint32_t name hash while searching. From pack-objects.{h,c} : ---- 8< ---- struct object_entry { ... uint32_t hash; /* name hint hash */ /* * We search for deltas in a list sorted by type, by filename hash, and then * by size, so that we see progressively smaller and smaller files. * That's because we prefer deltas to be from the bigger file * to the smaller -- deletes are potentially cheaper, but perhaps * more importantly, the bigger file is likely the more recent * one. The deepest deltas are therefore the oldest objects which are * less susceptible to be accessed often. */ static int type_size_sort(const void *_a, const void *_b) { const struct object_entry *a = *(struct object_entry **)_a; const struct object_entry *b = *(struct object_entry **)_b; if (a->type > b->type) return -1; if (a->type < b->type) return 1; if (a->hash > b->hash) return -1; if (a->hash < b->hash) return 1; ... ---- 8< ---- Documentation/technical/pack-heuristics.txt also confirms this: ---- 8< ---- ... <gitster> The quote from the above linus should be rewritten a bit (wait for it): - first sort by type. Different objects never delta with each other. - then sort by filename/dirname. hash of the basename occupies the top BITS_PER_INT-DIR_BITS bits, and bottom DIR_BITS are for the hash of leading path elements. ... If I might add, the trick is to make files that _might_ be similar be located close to each other in the hash buckets based on their file names. It used to be that "foo/Makefile", "bar/baz/quux/Makefile" and "Makefile" all landed in the same bucket due to their common basename, "Makefile". However, now they land in "close" buckets. The algorithm allows not just for the _same_ bucket, but for _close_ buckets to be considered delta candidates. The rationale is essentially that files, like Makefiles, often have very similar content no matter what directory they live in. ---- 8< ---- So yes, exactly as you say with pack.writeBitmapHashCache=true (ae4f07fb) the delta-search heuristics is almost as efficient as with just raw filenames. I can confirm this also via e.g. (with my patch applied) : ---- 8< ---- $ time echo 0186ac99 | git pack-objects --no-use-bitmap-index --revs erp5pack-plain Counting objects: 627171, done. Compressing objects: 100% (176949/176949), done. 50570987560d481742af4a8083028c2322a0534a Writing objects: 100% (627171/627171), done. Total 627171 (delta 439404), reused 594820 (delta 410210) real 0m37.272s user 0m33.648s sys 0m1.580s $ time echo 0186ac99 | git pack-objects --revs erp5pack-bitmap Counting objects: 627171, done. Compressing objects: 100% (176914/176914), done. 7c15a9b1eca1326e679297b217c5a48954625ca2 Writing objects: 100% (627171/627171), done. Total 627171 (delta 439484), reused 594855 (delta 410245) real 0m27.020s user 0m23.364s sys 0m0.992s $ ll erp5pack-{plain,bitmap}* 17561860 erp5pack-bitmap-7c15a9b1eca1326e679297b217c5a48954625ca2.idx 238760161 erp5pack-bitmap-7c15a9b1eca1326e679297b217c5a48954625ca2.pack 17561860 erp5pack-plain-50570987560d481742af4a8083028c2322a0534a.idx 238634201 erp5pack-plain-50570987560d481742af4a8083028c2322a0534a.pack ---- 8< ---- ( By the way about pack generated with bitmap retaining close recency order: ---- 8< ---- $ git verify-pack -v erp5pack-plain-50570987560d481742af4a8083028c2322a0534a.pack >1 $ git verify-pack -v erp5pack-bitmap-7c15a9b1eca1326e679297b217c5a48954625ca2.pack >2 $ grep commit 1 |awk '{print $1}' >1.commit $ grep commit 2 |awk '{print $1}' >2.commit $ wc -l 1.commit 46136 1.commit $ wc -l 2.commit 46136 2.commit $ diff -u0 1.commit 2.commit |wc -l 55 ---- 8< ---- so 55/46136 shows it is very almost the same. ) > There may be other subtle things, too. The general idea of tying the > bitmap use to pack_to_stdout is that you _do_ want to use it for > serving fetches and pushes, but for a full on-disk repack via gc, it's > more important to generate a good pack. It is better we send good packs to clients too, right? And with pack.writeBitmapHashCache=true and retaining recency order (please see above, but again maybe I'm not completely right) to me we should be still generating a good pack while using bitmap reachability index for object graph traversal. > Your use case: > > > git-backup extracts many packs on repositories restoration. That was my > > initial motivation for the patch. > > Seems to be somewhere in between. I'm not sure I understand how you're > invoking pack-objects here, It is just pack-objects --revs --reuse-object --reuse-delta --delta-base-offset extractedrepo/objects/pack/pack < SHA1-HEADS https://lab.nexedi.com/kirr/git-backup/blob/7fcb8c67/git-backup.go#L829 > but I wonder if you should be using "pack-objects --stdout" yourself. I already tried --stdout. The problem is on repository extraction we need to both extract the pack and index it. While `pack-object file` does both, for --stdout case we need to additionally index extracted pack with `git index-pack`, and standalone `git index-pack` is very slow - in my experience much slower than generating the pack itself: ---- 8< ---- $ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack Counting objects: 627171, done. Compressing objects: 100% (176914/176914), done. Total 627171 (delta 439484), reused 594855 (delta 410245) real 0m22.309s user 0m21.148s sys 0m0.932s $ ll erp5pack-stdout* 238760161 erp5pack-stdout.pack $ time git index-pack erp5pack-stdout.pack 7c15a9b1eca1326e679297b217c5a48954625ca2 real 0m50.873s <-- more than 2 times slower than time to generate pack itself! user 0m49.300s sys 0m1.360s $ ll erp5pack-stdout* 17561860 erp5pack-stdout.idx 238760161 erp5pack-stdout.pack ---- 8< ---- So the time for `pack-object --stdout >file.pack` + `index-pack file.pack` is 72s, while `pack-objects file.pack` which does both pack and index is 27s. And even `pack-objects --no-use-bitmap-index file.pack` is 37s. I've tried to briefly see why index-pack is so slow and offhand I can see that it needs to load all objects, decompresses them etc (maybe I'm not so right here - I looked only briefly), while pack-objects while generating the pack has all needed information directly at hand and thus can emit index much more easily. For sever - clients scenario, index-pack load is put onto clients thus offloading server, but for my use case where extracted repository is on the same machine the load does not go away. That's why for me it makes more sense to emit both pack and its index in one go. Still it would be interesting to eventually see why index-pack is so anomaly slow. > But even if it is the right thing for your use case to be using bitmaps > to generate an on-disk bitmap, I think we should be making sure it > _doesn't_ trigger when doing a normal repack. So seems the way forward here is to teach pack-objects not to silently drop explicit --use-pack-bitmap for cases when it can handle it? (currently even if this option was given, for !stdout cases pack-objects simply drop use_bitmap_index to 0). And to make sure default for use_bitmap_index is 0 for !stdout cases? Or are we fine with my arguments about recency order staying the same when using bitmap reachability index for object graph traversal, and this way the patch is fine to go in as it is? Thanks again, Kirill -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html