Re: [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Sep 19, 2022 at 10:08:35PM -0400, Taylor Blau wrote:
> Noticed this while working on a semi-related series in:
>
>   https://lore.kernel.org/git/cover.1663638929.git.me@xxxxxxxxxxxx/T/
>
> the savings here are pretty modest, but this is in line with the
> strategy we use in the `--geometric` repack mode, which performs a
> similar task.

To expand on my setup a little more, I ran the following script:

--- >8 ---
#!/bin/sh

repack_into_n () {
	rm -rf staging &&
	mkdir staging &&

	git rev-list --first-parent HEAD |
	perl -e '
		my $n = shift;
		while (<>) {
			last unless @commits < $n;
			push @commits, $_ if $. % 5 == 1;
		}
		print reverse @commits;
	' "$1" >pushes &&

	# create base packfile
	base_pack=$(
		head -n 1 pushes |
		git pack-objects --delta-base-offset --revs staging/pack
	) &&
	export base_pack &&

	# create an empty packfile
	empty_pack=$(git pack-objects staging/pack </dev/null) &&
	export empty_pack &&

	# and then incrementals between each pair of commits
	last= &&
	while read rev
	do
		if test -n "$last"; then
			{
				echo "$rev" &&
				echo "^$last"
			} |
			git pack-objects --delta-base-offset --revs \
				staging/pack || return 1
		fi
		last=$rev
	done <pushes &&

	(
		find staging -type f -name 'pack-*.pack' |
			xargs -n 1 basename | grep -v "$base_pack" &&
		printf "^pack-%s.pack\n" $base_pack
	) >stdin.packs

	# and install the whole thing
	rm -f .git/objects/pack/* &&
	mv staging/* .git/objects/pack/
}

# Pretend we just have a single branch and no reflogs, and that everything is
# in objects/pack; that makes our fake pack-building via repack_into_n()
# much simpler.
simplify_reachability() {
	tip=$(git rev-parse --verify HEAD) &&
	git for-each-ref --format="option no-deref%0adelete %(refname)" |
	git update-ref --stdin &&
	rm -rf .git/logs &&
	git update-ref refs/heads/master $tip &&
	git symbolic-ref HEAD refs/heads/master
}

simplify_reachability

for i in 100 1000 5000
do
  echo >&2 "==> $i pack(s)"

  repack_into_n $i

  rm .git/objects/pack/multi-pack-index
  find .git/objects/pack -type f | sort >before

  hyperfine -p './prepare.sh' \
	  'git multi-pack-index repack --batch-size=1G && ./report.sh' \
	  'git.compile multi-pack-index repack --batch-size=1G && ./report.sh'
done
--- 8< ---

...where `git.compile` is has this patch and `git` does not. The two
other scripts (prepare.sh, and report.sh, respectively) look as follows:

--- >8 ---
#!/bin/sh

find .git/objects/pack -type f | sort >after
comm -13 before after | xargs rm -f
rm -f .git/objects/pack/multi-pack-index
git multi-pack-index write
--- 8< ---

...and report.sh:

--- >8 ---
#!/bin/sh

find .git/objects/pack -type f | sort >after
for new in $(comm -13 before after)
do
	echo "==> $new ($(wc -c <$new))"
done
echo "-------------"
--- 8< ---

In general, the timings on git.git packed into 100 packs look something
like:

  Benchmark 1: git multi-pack-index repack --batch-size=1G
    Time (mean ± σ):      4.342 s ±  0.087 s    [User: 12.864 s, System: 0.396 s]
    Range (min … max):    4.235 s …  4.517 s    10 runs

  Benchmark 2: git.compile multi-pack-index repack --batch-size=1G
    Time (mean ± σ):      7.016 s ±  0.119 s    [User: 11.170 s, System: 0.469 s]
    Range (min … max):    6.858 s …  7.233 s    10 runs

But if I rip out the traversal pass towards the end of
`read_packs_list_from_stdin()` in `builtin/pack-objects.c`, the two
timings are equal. So the slow-down here really is from the traversal
pass.

The savings are modest, probably because we're already starting with a
pretty well packed baseline, since we're feeding objects in pack order.
On average, I was able to see around a ~3.5% reduction in pack size or
so.

So, not amazing, but mirroring the behavior of `git repack
--geometric=<n>` is worthwhile for all of the reasons that we do this
there.

I should also mention that this applies cleanly against `master`, and
doesn't depend on or interact with my changes in the series above.

Thanks,
Taylor



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux