On Tue, Nov 28, 2023 at 02:07:54PM -0500, Taylor Blau wrote: > Back in fff42755ef (pack-bitmap: add support for bitmap indexes, > 2013-12-21), we added support for reachability bitmaps, and taught > pack-objects how to reuse verbatim chunks from the bitmapped pack. When > multi-pack bitmaps were introduced, this pack-reuse mechanism evolved to > use the MIDX's "preferred" pack as the source for verbatim reuse. > > This allows repositories to incrementally repack themselves (e.g., using > a `--geometric` repack), storing the result in a MIDX, and generating a > corresponding bitmap. This keeps our bitmap coverage up-to-date, while > maintaining a relatively small number of packs. > > However, it is recommended (and matches what we do in production at > GitHub) that repositories repack themselves all-into-one, and > generate a corresponding single-pack reachability bitmap. This is done > for a couple of reasons, but the most relevant one to this series is > that it enables us to perform verbatim pack-reuse over a complete copy > of the repository, since the entire repository resides in a single pack > (and thus is eligible for verbatim pack-reuse). > > As repositories grow larger, packing their contents into a single pack > becomes less feasible. This series extends the pack-reuse mechanism to > operate over multiple packs which are known ahead of time to be disjoint > with respect to one another's set of objects. > > The implementation has a few components: > > - A new MIDX chunk called "Disjoint packfiles" or DISP is introduced > to keep track of the bitmap position, number of objects, and > disjointed-ness for each pack contained in the MIDX. > > - A new mode for `git multi-pack-index write --stdin-packs` that > allows specifying disjoint packs, as well as a new option > `--retain-disjoint` which preserves the set of existing disjoint > packs in the new MIDX. > > - A new pack-objects mode `--ignore-disjoint`, which produces packs > which are disjoint with respect to the current set of disjoint packs > (i.e. it discards any objects from the packing list which appear in > any of the known-disjoint packs). > > - A new repack mode, `--extend-disjoint` which causes any new pack(s) > which are generated to be disjoint with respect to the set of packs > currently marked as disjoint, minus any pack(s) which are about to > be deleted. > > With all of that in place, the patch series then rewrites all of the > pack-reuse functions in terms of the new `bitmapped_pack` structure. > Once we have dropped all of the assumptions stemming from only > performing pack-reuse over a single candidate pack, we can then enable > reuse over all of the disjoint packs. > > In addition to the many new tests in t5332 added by that series, I tried > to simulate a "real world" test on git.git by breaking the repository > into chunks of 1,000 commits (plus their set of reachable objects not > reachable from earlier chunk(s)) and packing those chunks. This produces > a large number of packs with the objects from git.git which are known to > be disjoint with respect to one another. > > $ git clone git@xxxxxxxxxx:git/git.git base > > $ cd base > $ mv .git/objects/pack/pack-*.idx{,.bak} > $ git unpack-objects <.git/objects/pack/pack-*.pack > > # pack the objects from each successive block of 1k commits > $ for rev in $(git rev-list --all | awk '(NR) % 1000 == 0' | tac) > do > echo $rev | > git.compile pack-objects --revs --unpacked .git/objects/pack/pack || return 1 > done > # and grab any stragglers, pruning the unpacked objects > $ git repack -d > I then constructed a MIDX and corresponding bitmap > > $ find_pack () { > for idx in .git/objects/pack/pack-*.idx > do > git show-index <$idx | grep -q "$1" && basename $idx > done > } > $ preferred="$(find_pack $(git rev-parse HEAD))" > > $ ( cd .git/objects/pack && ls -1 *.idx ) | sed -e 's/^/+/g' | > git.compile multi-pack-index write --bitmap --stdin-packs \ > --preferred-pack=$preferred > $ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in > > With all of that in place, I was able to produce a significant speed-up > by reusing objects from multiple packs: > > $ hyperfine -L v single,multi -n '{v}-pack reuse' 'git.compile -c pack.allowPackReuse={v} pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in >/dev/null' > Benchmark 1: single-pack reuse > Time (mean ± σ): 6.094 s ± 0.023 s [User: 43.723 s, System: 0.358 s] > Range (min … max): 6.063 s … 6.126 s 10 runs > > Benchmark 2: multi-pack reuse > Time (mean ± σ): 906.5 ms ± 3.2 ms [User: 1081.5 ms, System: 30.9 ms] > Range (min … max): 903.5 ms … 912.7 ms 10 runs > > Summary > multi-pack reuse ran > 6.72 ± 0.03 times faster than single-pack reuse > > (There are corresponding tests in p5332 that test different sized chunks > and measure the runtime performance as well as resulting pack size). > > Performing verbatim pack reuse naturally trades off between CPU time and > the resulting pack size. In the above example, the single-pack reuse > case produces a clone size of ~194 MB on my machine, while the > multi-pack reuse case produces a clone size closer to ~266 MB, which is > a ~37% increase in clone size. Quite exciting, and a tradeoff that may be worth it for Git hosters. I expect that this is going to be an extreme example of the benefits provided by your patch series -- do you by any chance also have "real" numbers that make it possible to quantify the effect a bit better? No worry if you don't, I'm just curious. > I think there is still some opportunity to close this gap, since the > "packing" strategy here is extremely naive. In a production setting, I'm > sure that there are more well thought out repacking strategies that > would produce more similar clone sizes. > > I considered breaking this series up into smaller chunks, but was > unsatisfied with the result. Since this series is rather large, if you > have alternate suggestions on better ways to structure this, please let > me know. The series is indeed very involved to review. I only made it up to patch 8/24 and already spent quite some time on it. So I'd certainly welcome it if this was split up into smaller parts, but don't have a suggestion as to how this should be done (also because I didn't yet read the other 16 patches). I'll review the remaining patches at a later point in time. Patrick > Thanks in advance for your review! > > Taylor Blau (24): > pack-objects: free packing_data in more places > pack-bitmap-write: deep-clear the `bb_commit` slab > pack-bitmap: plug leak in find_objects() > midx: factor out `fill_pack_info()` > midx: implement `DISP` chunk > midx: implement `midx_locate_pack()` > midx: implement `--retain-disjoint` mode > pack-objects: implement `--ignore-disjoint` mode > repack: implement `--extend-disjoint` mode > pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions > pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature > pack-bitmap: return multiple packs via > `reuse_partial_packfile_from_bitmap()` > pack-objects: parameterize pack-reuse routines over a single pack > pack-objects: keep track of `pack_start` for each reuse pack > pack-objects: pass `bitmapped_pack`'s to pack-reuse functions > pack-objects: prepare `write_reused_pack()` for multi-pack reuse > pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack > reuse > pack-objects: include number of packs reused in output > pack-bitmap: prepare to mark objects from multiple packs for reuse > pack-objects: add tracing for various packfile metrics > t/test-lib-functions.sh: implement `test_trace2_data` helper > pack-objects: allow setting `pack.allowPackReuse` to "single" > pack-bitmap: reuse objects from all disjoint packs > t/perf: add performance tests for multi-pack reuse > > Documentation/config/pack.txt | 8 +- > Documentation/git-multi-pack-index.txt | 12 ++ > Documentation/git-pack-objects.txt | 8 + > Documentation/git-repack.txt | 12 ++ > Documentation/gitformat-pack.txt | 109 ++++++++++ > builtin/multi-pack-index.c | 13 +- > builtin/pack-objects.c | 200 +++++++++++++++---- > builtin/repack.c | 57 +++++- > midx.c | 218 +++++++++++++++++--- > midx.h | 11 +- > pack-bitmap-write.c | 9 +- > pack-bitmap.c | 265 ++++++++++++++++++++----- > pack-bitmap.h | 18 +- > pack-objects.c | 15 ++ > pack-objects.h | 1 + > t/helper/test-read-midx.c | 31 ++- > t/lib-disjoint.sh | 49 +++++ > t/perf/p5332-multi-pack-reuse.sh | 81 ++++++++ > t/t5319-multi-pack-index.sh | 140 +++++++++++++ > t/t5331-pack-objects-stdin.sh | 156 +++++++++++++++ > t/t5332-multi-pack-reuse.sh | 219 ++++++++++++++++++++ > t/t6113-rev-list-bitmap-filters.sh | 2 + > t/t7700-repack.sh | 4 +- > t/t7705-repack-extend-disjoint.sh | 142 +++++++++++++ > t/test-lib-functions.sh | 14 ++ > 25 files changed, 1650 insertions(+), 144 deletions(-) > create mode 100644 t/lib-disjoint.sh > create mode 100755 t/perf/p5332-multi-pack-reuse.sh > create mode 100755 t/t5332-multi-pack-reuse.sh > create mode 100755 t/t7705-repack-extend-disjoint.sh > > > base-commit: 564d0252ca632e0264ed670534a51d18a689ef5d > -- > 2.43.0.24.g980b318f98
Attachment:
signature.asc
Description: PGP signature