On Tue, Aug 07, 2018 at 04:23:04PM -0700, Jonathan Tan wrote: > > Do we already have an access to the in-core reverse index for the > > pack at this point in the code? > > As far as I can tell, no. These patches construct the list of promisor > objects in repack.c (which does not use the revindex at all), to be > processed by pack-objects in a different process (which does use the > revindex in reuse-delta mode, which is the default). I guess that we > could have access to the revindex if we were to libify pack-objects and > run it in the same process as repack.c or if we were to add a special > mode to pack-objects that reads for itself the list of all the promisor > objects. It's not the end of the world to access the revindex here. It requires effort proportional to the size of the packfile, which makes it a bad choice for operations that want to touch only a few objects. But here you're iterating over the whole packfile anyway. I think that the for_each_packed_object() interface should visit items in pack order. Something like this (I'd probably change the name of the inner "i" variable for clarity, but I didn't here to minimize the diff): diff --git a/packfile.c b/packfile.c index 04d387f312..334f47e4cd 100644 --- a/packfile.c +++ b/packfile.c @@ -1887,10 +1887,13 @@ int has_pack_index(const unsigned char *sha1) int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn cb, void *data) { - uint32_t i; + uint32_t in_pack; int r = 0; - for (i = 0; i < p->num_objects; i++) { + load_pack_revindex(p); + + for (in_pack = 0; in_pack < p->num_objects; in_pack++) { + uint32_t i = p->revindex[in_pack].nr; struct object_id oid; if (!nth_packed_object_oid(&oid, p, i)) Possibly it should be an optional flag, in case the caller is going to eventually sort by sha1 anyway. But the major caller that sorts by sha1 is cat-file's --batch-all-objects option. And IMHO that is actually a bad default. Try timing this in a fully-packed repository: time git cat-file --batch-all-objects --buffer --batch | wc -c versus this: time git show-index <.git/objects/pack/*.idx | sort -n | awk '{print $2}' | git cat-file --buffer --batch | wc -c On my git.git repo, it's literally 7x faster to do the latter, because of locality in resolving deltas. So I think we should consider at least making it an option for cat-file to do the iteration in pack-ish order (it can't be perfect, of course, because there may be multiple packs, loose objects, etc, but approximating it is a good idea). I guess it has to be an option, since we advertise in the manpage that the output is sorted by hash. > > If so, we can enumerate the objects > > in the offset order without incurring a lot of cost (building the > > in-core revindex is the more expensive part). When writing a pack, > > we try to make sure that related objects are written out close to > > each other [*1*], so listing them in the offset order (with made-up > > pathname information to _force_ that objects that live close > > together in the original pack are grouped together by getting > > similar names) might give us a better than horrible deltification. > > I dunno. > > Before I write the NEEDSWORK comment, just to clarify - do you mean > better than horrible object locality? I think deltification still occurs > because pack-objects sorts the input when doing the windowed > deltification, for example: > > git rev-parse HEAD:fetch-pack.c HEAD HEAD^ HEAD^^ \ > HEAD^^^ v2.17.0:fetch-pack.c | git pack-objects --window=2 abc > > produces a packfile with a delta, as checked by `verify-pack -v`. You're going to get a lot of reused deltas from the old pack, but you'll probably miss opportunities to delta objects between packs. And pack-object's compute_write_order() will fix some of the issues: - it will group objects of each type together - it will group delta "families" together So the resulting packfile may not actually be _too_ bad with respect to delta locality. Where it will fail is: - there's no locality with respect to history, so a traversal will have bad I/O patterns as it skips all over the packfile - the delta heuristics do use the incoming order (which usually is from a traversal) as a hint. But IMHO that is way less important than lacking the names. Even if you have an actual traversal, two versions of the same file may be separated by hundreds of other blob changes in the middle. Traversal order doesn't help you; putting things with similar names next to each other does. For traversals using bitmaps, we have a "name hash cache" in the .bitmap file, which stores the 32-bit hash of the name. That lets us put similar names next to each other without actually knowing the names. If we absolutely can't traverse as part of the strategy for repacking a promisor pack, then it may be worth trying to figure out how to carry that hash around. -Peff