Back in fff42755ef (pack-bitmap: add support for bitmap indexes, 2013-12-21), we added support for reachability bitmaps, and taught pack-objects how to reuse verbatim chunks from the bitmapped pack. When multi-pack bitmaps were introduced, this pack-reuse mechanism evolved to use the MIDX's "preferred" pack as the source for verbatim reuse. This allows repositories to incrementally repack themselves (e.g., using a `--geometric` repack), storing the result in a MIDX, and generating a corresponding bitmap. This keeps our bitmap coverage up-to-date, while maintaining a relatively small number of packs. However, it is recommended (and matches what we do in production at GitHub) that repositories repack themselves all-into-one, and generate a corresponding single-pack reachability bitmap. This is done for a couple of reasons, but the most relevant one to this series is that it enables us to perform verbatim pack-reuse over a complete copy of the repository, since the entire repository resides in a single pack (and thus is eligible for verbatim pack-reuse). As repositories grow larger, packing their contents into a single pack becomes less feasible. This series extends the pack-reuse mechanism to operate over multiple packs which are known ahead of time to be disjoint with respect to one another's set of objects. The implementation has a few components: - A new MIDX chunk called "Disjoint packfiles" or DISP is introduced to keep track of the bitmap position, number of objects, and disjointed-ness for each pack contained in the MIDX. - A new mode for `git multi-pack-index write --stdin-packs` that allows specifying disjoint packs, as well as a new option `--retain-disjoint` which preserves the set of existing disjoint packs in the new MIDX. - A new pack-objects mode `--ignore-disjoint`, which produces packs which are disjoint with respect to the current set of disjoint packs (i.e. it discards any objects from the packing list which appear in any of the known-disjoint packs). - A new repack mode, `--extend-disjoint` which causes any new pack(s) which are generated to be disjoint with respect to the set of packs currently marked as disjoint, minus any pack(s) which are about to be deleted. With all of that in place, the patch series then rewrites all of the pack-reuse functions in terms of the new `bitmapped_pack` structure. Once we have dropped all of the assumptions stemming from only performing pack-reuse over a single candidate pack, we can then enable reuse over all of the disjoint packs. In addition to the many new tests in t5332 added by that series, I tried to simulate a "real world" test on git.git by breaking the repository into chunks of 1,000 commits (plus their set of reachable objects not reachable from earlier chunk(s)) and packing those chunks. This produces a large number of packs with the objects from git.git which are known to be disjoint with respect to one another. $ git clone git@xxxxxxxxxx:git/git.git base $ cd base $ mv .git/objects/pack/pack-*.idx{,.bak} $ git unpack-objects <.git/objects/pack/pack-*.pack # pack the objects from each successive block of 1k commits $ for rev in $(git rev-list --all | awk '(NR) % 1000 == 0' | tac) do echo $rev | git.compile pack-objects --revs --unpacked .git/objects/pack/pack || return 1 done # and grab any stragglers, pruning the unpacked objects $ git repack -d I then constructed a MIDX and corresponding bitmap $ find_pack () { for idx in .git/objects/pack/pack-*.idx do git show-index <$idx | grep -q "$1" && basename $idx done } $ preferred="$(find_pack $(git rev-parse HEAD))" $ ( cd .git/objects/pack && ls -1 *.idx ) | sed -e 's/^/+/g' | git.compile multi-pack-index write --bitmap --stdin-packs \ --preferred-pack=$preferred $ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in With all of that in place, I was able to produce a significant speed-up by reusing objects from multiple packs: $ hyperfine -L v single,multi -n '{v}-pack reuse' 'git.compile -c pack.allowPackReuse={v} pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in >/dev/null' Benchmark 1: single-pack reuse Time (mean ± σ): 6.094 s ± 0.023 s [User: 43.723 s, System: 0.358 s] Range (min … max): 6.063 s … 6.126 s 10 runs Benchmark 2: multi-pack reuse Time (mean ± σ): 906.5 ms ± 3.2 ms [User: 1081.5 ms, System: 30.9 ms] Range (min … max): 903.5 ms … 912.7 ms 10 runs Summary multi-pack reuse ran 6.72 ± 0.03 times faster than single-pack reuse (There are corresponding tests in p5332 that test different sized chunks and measure the runtime performance as well as resulting pack size). Performing verbatim pack reuse naturally trades off between CPU time and the resulting pack size. In the above example, the single-pack reuse case produces a clone size of ~194 MB on my machine, while the multi-pack reuse case produces a clone size closer to ~266 MB, which is a ~37% increase in clone size. I think there is still some opportunity to close this gap, since the "packing" strategy here is extremely naive. In a production setting, I'm sure that there are more well thought out repacking strategies that would produce more similar clone sizes. I considered breaking this series up into smaller chunks, but was unsatisfied with the result. Since this series is rather large, if you have alternate suggestions on better ways to structure this, please let me know. Thanks in advance for your review! Taylor Blau (24): pack-objects: free packing_data in more places pack-bitmap-write: deep-clear the `bb_commit` slab pack-bitmap: plug leak in find_objects() midx: factor out `fill_pack_info()` midx: implement `DISP` chunk midx: implement `midx_locate_pack()` midx: implement `--retain-disjoint` mode pack-objects: implement `--ignore-disjoint` mode repack: implement `--extend-disjoint` mode pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` pack-objects: parameterize pack-reuse routines over a single pack pack-objects: keep track of `pack_start` for each reuse pack pack-objects: pass `bitmapped_pack`'s to pack-reuse functions pack-objects: prepare `write_reused_pack()` for multi-pack reuse pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack reuse pack-objects: include number of packs reused in output pack-bitmap: prepare to mark objects from multiple packs for reuse pack-objects: add tracing for various packfile metrics t/test-lib-functions.sh: implement `test_trace2_data` helper pack-objects: allow setting `pack.allowPackReuse` to "single" pack-bitmap: reuse objects from all disjoint packs t/perf: add performance tests for multi-pack reuse Documentation/config/pack.txt | 8 +- Documentation/git-multi-pack-index.txt | 12 ++ Documentation/git-pack-objects.txt | 8 + Documentation/git-repack.txt | 12 ++ Documentation/gitformat-pack.txt | 109 ++++++++++ builtin/multi-pack-index.c | 13 +- builtin/pack-objects.c | 200 +++++++++++++++---- builtin/repack.c | 57 +++++- midx.c | 218 +++++++++++++++++--- midx.h | 11 +- pack-bitmap-write.c | 9 +- pack-bitmap.c | 265 ++++++++++++++++++++----- pack-bitmap.h | 18 +- pack-objects.c | 15 ++ pack-objects.h | 1 + t/helper/test-read-midx.c | 31 ++- t/lib-disjoint.sh | 49 +++++ t/perf/p5332-multi-pack-reuse.sh | 81 ++++++++ t/t5319-multi-pack-index.sh | 140 +++++++++++++ t/t5331-pack-objects-stdin.sh | 156 +++++++++++++++ t/t5332-multi-pack-reuse.sh | 219 ++++++++++++++++++++ t/t6113-rev-list-bitmap-filters.sh | 2 + t/t7700-repack.sh | 4 +- t/t7705-repack-extend-disjoint.sh | 142 +++++++++++++ t/test-lib-functions.sh | 14 ++ 25 files changed, 1650 insertions(+), 144 deletions(-) create mode 100644 t/lib-disjoint.sh create mode 100755 t/perf/p5332-multi-pack-reuse.sh create mode 100755 t/t5332-multi-pack-reuse.sh create mode 100755 t/t7705-repack-extend-disjoint.sh base-commit: 564d0252ca632e0264ed670534a51d18a689ef5d -- 2.43.0.24.g980b318f98