On Mon, Sep 19, 2022 at 10:08:35PM -0400, Taylor Blau wrote: > Noticed this while working on a semi-related series in: > > https://lore.kernel.org/git/cover.1663638929.git.me@xxxxxxxxxxxx/T/ > > the savings here are pretty modest, but this is in line with the > strategy we use in the `--geometric` repack mode, which performs a > similar task. To expand on my setup a little more, I ran the following script: --- >8 --- #!/bin/sh repack_into_n () { rm -rf staging && mkdir staging && git rev-list --first-parent HEAD | perl -e ' my $n = shift; while (<>) { last unless @commits < $n; push @commits, $_ if $. % 5 == 1; } print reverse @commits; ' "$1" >pushes && # create base packfile base_pack=$( head -n 1 pushes | git pack-objects --delta-base-offset --revs staging/pack ) && export base_pack && # create an empty packfile empty_pack=$(git pack-objects staging/pack </dev/null) && export empty_pack && # and then incrementals between each pair of commits last= && while read rev do if test -n "$last"; then { echo "$rev" && echo "^$last" } | git pack-objects --delta-base-offset --revs \ staging/pack || return 1 fi last=$rev done <pushes && ( find staging -type f -name 'pack-*.pack' | xargs -n 1 basename | grep -v "$base_pack" && printf "^pack-%s.pack\n" $base_pack ) >stdin.packs # and install the whole thing rm -f .git/objects/pack/* && mv staging/* .git/objects/pack/ } # Pretend we just have a single branch and no reflogs, and that everything is # in objects/pack; that makes our fake pack-building via repack_into_n() # much simpler. simplify_reachability() { tip=$(git rev-parse --verify HEAD) && git for-each-ref --format="option no-deref%0adelete %(refname)" | git update-ref --stdin && rm -rf .git/logs && git update-ref refs/heads/master $tip && git symbolic-ref HEAD refs/heads/master } simplify_reachability for i in 100 1000 5000 do echo >&2 "==> $i pack(s)" repack_into_n $i rm .git/objects/pack/multi-pack-index find .git/objects/pack -type f | sort >before hyperfine -p './prepare.sh' \ 'git multi-pack-index repack --batch-size=1G && ./report.sh' \ 'git.compile multi-pack-index repack --batch-size=1G && ./report.sh' done --- 8< --- ...where `git.compile` is has this patch and `git` does not. The two other scripts (prepare.sh, and report.sh, respectively) look as follows: --- >8 --- #!/bin/sh find .git/objects/pack -type f | sort >after comm -13 before after | xargs rm -f rm -f .git/objects/pack/multi-pack-index git multi-pack-index write --- 8< --- ...and report.sh: --- >8 --- #!/bin/sh find .git/objects/pack -type f | sort >after for new in $(comm -13 before after) do echo "==> $new ($(wc -c <$new))" done echo "-------------" --- 8< --- In general, the timings on git.git packed into 100 packs look something like: Benchmark 1: git multi-pack-index repack --batch-size=1G Time (mean ± σ): 4.342 s ± 0.087 s [User: 12.864 s, System: 0.396 s] Range (min … max): 4.235 s … 4.517 s 10 runs Benchmark 2: git.compile multi-pack-index repack --batch-size=1G Time (mean ± σ): 7.016 s ± 0.119 s [User: 11.170 s, System: 0.469 s] Range (min … max): 6.858 s … 7.233 s 10 runs But if I rip out the traversal pass towards the end of `read_packs_list_from_stdin()` in `builtin/pack-objects.c`, the two timings are equal. So the slow-down here really is from the traversal pass. The savings are modest, probably because we're already starting with a pretty well packed baseline, since we're feeding objects in pack order. On average, I was able to see around a ~3.5% reduction in pack size or so. So, not amazing, but mirroring the behavior of `git repack --geometric=<n>` is worthwhile for all of the reasons that we do this there. I should also mention that this applies cleanly against `master`, and doesn't depend on or interact with my changes in the series above. Thanks, Taylor