Re: [PATCH 00/24] pack-objects: multi-pack verbatim reuse

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Nov 28, 2023 at 02:07:54PM -0500, Taylor Blau wrote:
> Back in fff42755ef (pack-bitmap: add support for bitmap indexes,
> 2013-12-21), we added support for reachability bitmaps, and taught
> pack-objects how to reuse verbatim chunks from the bitmapped pack. When
> multi-pack bitmaps were introduced, this pack-reuse mechanism evolved to
> use the MIDX's "preferred" pack as the source for verbatim reuse.
> 
> This allows repositories to incrementally repack themselves (e.g., using
> a `--geometric` repack), storing the result in a MIDX, and generating a
> corresponding bitmap. This keeps our bitmap coverage up-to-date, while
> maintaining a relatively small number of packs.
> 
> However, it is recommended (and matches what we do in production at
> GitHub) that repositories repack themselves all-into-one, and
> generate a corresponding single-pack reachability bitmap. This is done
> for a couple of reasons, but the most relevant one to this series is
> that it enables us to perform verbatim pack-reuse over a complete copy
> of the repository, since the entire repository resides in a single pack
> (and thus is eligible for verbatim pack-reuse).
> 
> As repositories grow larger, packing their contents into a single pack
> becomes less feasible. This series extends the pack-reuse mechanism to
> operate over multiple packs which are known ahead of time to be disjoint
> with respect to one another's set of objects.
> 
> The implementation has a few components:
> 
>   - A new MIDX chunk called "Disjoint packfiles" or DISP is introduced
>     to keep track of the bitmap position, number of objects, and
>     disjointed-ness for each pack contained in the MIDX.
> 
>   - A new mode for `git multi-pack-index write --stdin-packs` that
>     allows specifying disjoint packs, as well as a new option
>     `--retain-disjoint` which preserves the set of existing disjoint
>     packs in the new MIDX.
> 
>   - A new pack-objects mode `--ignore-disjoint`, which produces packs
>     which are disjoint with respect to the current set of disjoint packs
>     (i.e. it discards any objects from the packing list which appear in
>     any of the known-disjoint packs).
> 
>   - A new repack mode, `--extend-disjoint` which causes any new pack(s)
>     which are generated to be disjoint with respect to the set of packs
>     currently marked as disjoint, minus any pack(s) which are about to
>     be deleted.
> 
> With all of that in place, the patch series then rewrites all of the
> pack-reuse functions in terms of the new `bitmapped_pack` structure.
> Once we have dropped all of the assumptions stemming from only
> performing pack-reuse over a single candidate pack, we can then enable
> reuse over all of the disjoint packs.
> 
> In addition to the many new tests in t5332 added by that series, I tried
> to simulate a "real world" test on git.git by breaking the repository
> into chunks of 1,000 commits (plus their set of reachable objects not
> reachable from earlier chunk(s)) and packing those chunks. This produces
> a large number of packs with the objects from git.git which are known to
> be disjoint with respect to one another.
> 
>     $ git clone git@xxxxxxxxxx:git/git.git base
> 
>     $ cd base
>     $ mv .git/objects/pack/pack-*.idx{,.bak}
>     $ git unpack-objects <.git/objects/pack/pack-*.pack
> 
>     # pack the objects from each successive block of 1k commits
>     $ for rev in $(git rev-list --all | awk '(NR) % 1000 == 0' | tac)
>       do
>         echo $rev |
>         git.compile pack-objects --revs --unpacked .git/objects/pack/pack || return 1
>       done
>     # and grab any stragglers, pruning the unpacked objects
>     $ git repack -d
>     I then constructed a MIDX and corresponding bitmap
> 
>     $ find_pack () {
>         for idx in .git/objects/pack/pack-*.idx
>         do
>           git show-index <$idx | grep -q "$1" && basename $idx
>         done
>       }
>     $ preferred="$(find_pack $(git rev-parse HEAD))"
> 
>     $ ( cd .git/objects/pack && ls -1 *.idx ) | sed -e 's/^/+/g' |
>         git.compile multi-pack-index write --bitmap --stdin-packs \
>           --preferred-pack=$preferred
>     $ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
> 
> With all of that in place, I was able to produce a significant speed-up
> by reusing objects from multiple packs:
> 
>     $ hyperfine -L v single,multi -n '{v}-pack reuse' 'git.compile -c pack.allowPackReuse={v} pack-objects --revs --stdout --use-bitmap-index --delta-base-offset <in >/dev/null'
>     Benchmark 1: single-pack reuse
>       Time (mean ± σ):      6.094 s ±  0.023 s    [User: 43.723 s, System: 0.358 s]
>       Range (min … max):    6.063 s …  6.126 s    10 runs
> 
>     Benchmark 2: multi-pack reuse
>       Time (mean ± σ):     906.5 ms ±   3.2 ms    [User: 1081.5 ms, System: 30.9 ms]
>       Range (min … max):   903.5 ms … 912.7 ms    10 runs
> 
>     Summary
>       multi-pack reuse ran
>         6.72 ± 0.03 times faster than single-pack reuse
> 
> (There are corresponding tests in p5332 that test different sized chunks
> and measure the runtime performance as well as resulting pack size).
> 
> Performing verbatim pack reuse naturally trades off between CPU time and
> the resulting pack size. In the above example, the single-pack reuse
> case produces a clone size of ~194 MB on my machine, while the
> multi-pack reuse case produces a clone size closer to ~266 MB, which is
> a ~37% increase in clone size.

Quite exciting, and a tradeoff that may be worth it for Git hosters. I
expect that this is going to be an extreme example of the benefits
provided by your patch series -- do you by any chance also have "real"
numbers that make it possible to quantify the effect a bit better?

No worry if you don't, I'm just curious.

> I think there is still some opportunity to close this gap, since the
> "packing" strategy here is extremely naive. In a production setting, I'm
> sure that there are more well thought out repacking strategies that
> would produce more similar clone sizes.
> 
> I considered breaking this series up into smaller chunks, but was
> unsatisfied with the result. Since this series is rather large, if you
> have alternate suggestions on better ways to structure this, please let
> me know.

The series is indeed very involved to review. I only made it up to patch
8/24 and already spent quite some time on it. So I'd certainly welcome
it if this was split up into smaller parts, but don't have a suggestion
as to how this should be done (also because I didn't yet read the other
16 patches).

I'll review the remaining patches at a later point in time.

Patrick

> Thanks in advance for your review!
> 
> Taylor Blau (24):
>   pack-objects: free packing_data in more places
>   pack-bitmap-write: deep-clear the `bb_commit` slab
>   pack-bitmap: plug leak in find_objects()
>   midx: factor out `fill_pack_info()`
>   midx: implement `DISP` chunk
>   midx: implement `midx_locate_pack()`
>   midx: implement `--retain-disjoint` mode
>   pack-objects: implement `--ignore-disjoint` mode
>   repack: implement `--extend-disjoint` mode
>   pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
>   pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
>   pack-bitmap: return multiple packs via
>     `reuse_partial_packfile_from_bitmap()`
>   pack-objects: parameterize pack-reuse routines over a single pack
>   pack-objects: keep track of `pack_start` for each reuse pack
>   pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
>   pack-objects: prepare `write_reused_pack()` for multi-pack reuse
>   pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack
>     reuse
>   pack-objects: include number of packs reused in output
>   pack-bitmap: prepare to mark objects from multiple packs for reuse
>   pack-objects: add tracing for various packfile metrics
>   t/test-lib-functions.sh: implement `test_trace2_data` helper
>   pack-objects: allow setting `pack.allowPackReuse` to "single"
>   pack-bitmap: reuse objects from all disjoint packs
>   t/perf: add performance tests for multi-pack reuse
> 
>  Documentation/config/pack.txt          |   8 +-
>  Documentation/git-multi-pack-index.txt |  12 ++
>  Documentation/git-pack-objects.txt     |   8 +
>  Documentation/git-repack.txt           |  12 ++
>  Documentation/gitformat-pack.txt       | 109 ++++++++++
>  builtin/multi-pack-index.c             |  13 +-
>  builtin/pack-objects.c                 | 200 +++++++++++++++----
>  builtin/repack.c                       |  57 +++++-
>  midx.c                                 | 218 +++++++++++++++++---
>  midx.h                                 |  11 +-
>  pack-bitmap-write.c                    |   9 +-
>  pack-bitmap.c                          | 265 ++++++++++++++++++++-----
>  pack-bitmap.h                          |  18 +-
>  pack-objects.c                         |  15 ++
>  pack-objects.h                         |   1 +
>  t/helper/test-read-midx.c              |  31 ++-
>  t/lib-disjoint.sh                      |  49 +++++
>  t/perf/p5332-multi-pack-reuse.sh       |  81 ++++++++
>  t/t5319-multi-pack-index.sh            | 140 +++++++++++++
>  t/t5331-pack-objects-stdin.sh          | 156 +++++++++++++++
>  t/t5332-multi-pack-reuse.sh            | 219 ++++++++++++++++++++
>  t/t6113-rev-list-bitmap-filters.sh     |   2 +
>  t/t7700-repack.sh                      |   4 +-
>  t/t7705-repack-extend-disjoint.sh      | 142 +++++++++++++
>  t/test-lib-functions.sh                |  14 ++
>  25 files changed, 1650 insertions(+), 144 deletions(-)
>  create mode 100644 t/lib-disjoint.sh
>  create mode 100755 t/perf/p5332-multi-pack-reuse.sh
>  create mode 100755 t/t5332-multi-pack-reuse.sh
>  create mode 100755 t/t7705-repack-extend-disjoint.sh
> 
> 
> base-commit: 564d0252ca632e0264ed670534a51d18a689ef5d
> -- 
> 2.43.0.24.g980b318f98

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux