Re: [PATCH 1/2] pack-objects: break out of want_object loop early

Jeff King <peff@xxxxxxxx> · Wed, 27 Jul 2016 17:13:38 -0400

On Tue, Jul 26, 2016 at 02:38:28PM -0700, Junio C Hamano wrote:

> Jeff King <peff@xxxxxxxx> writes:
> 
> > I got side-tracked by adding a t/perf test to show off the improvement.
> > It's rather tricky to get right and takes a long time to run. I _think_
> > I have it now, but am waiting for results. :)
> 
> Well, then I'd stop here and wait for the reroll to requeue.

So I hoped to follow up with my little perf-testing patch last night,
but the rabbit hole goes deeper. :)

It turns out that I could not replicate my earlier results using a perf
script like the one below. The problem is that it heavily depends on the
ordering of your packs. The early-break does not do any good if you
mostly end up finding your objects in the final pack in the list. And
that is the exact situation my perf script creates: imagine you had a
big repository, then 1000 pushes came in, and then you tried to run "git
repack". We sort reverse-chronologically, so the big pack is at the end,
and most lookups have to go through the entire pack list to get there.

In the normal object-lookup paths, we cache the last_found_pack and
check it first, but this loop has no equivalent (and until now it
wouldn't have helped much, because we generally had to walk the whole
list anyway).

I knew this was a potential issue and even had experimented this back
when the original patches were written, but in my experiments then it
didn't help much. That makes sense, though; any case that _was_ improved
by the first two patches would not benefit from reordering the pack
lookups, because that meant it was finding the objects early in the
search (or else the first two patches would have helped not at all).

And I think the particular case I was experimenting on back then was not
a normal "oops, we had some pushes and forgot to run gc". It was more
bizarre, and had several packs that had most of the objects duplicated.

So I think this is an area worth looking into; a series of small
incremental packs on top of a large one is what I would expect to be
the most common case, and the first two patches don't handle it well.

I tried a hacky version of the last_found_pack trick, and it does cut
the counting phase in half for my perf test. But no patch yet. One is
just that doing it cleanly is a little tricky. But two is that I've
wondered if we can do even better with a most-recently-used cache
instead of the last_pack_found hack. So I'm trying to implement and
measure that (both for this loop, and to see if it does better in
find_pack_entry).

Perf script below is just for a sneak peek at what I'm trying to
measure.

-Peff

-- >8 --
#!/bin/sh

test_description='performance with large numbers of packs'
. ./perf-lib.sh

test_perf_large_repo

# Pretend we just have a single branch and no reflogs, and that everything is
# in objects/pack; that makes our fake pack-building in the next step much
# simpler.
test_expect_success 'simplify reachability' '
	tip=$(git rev-parse --verify HEAD) &&
	git for-each-ref --format="option no-deref%0adelete %(refname)" |
	git update-ref --stdin &&
	rm -rf .git/logs &&
	git update-ref refs/heads/master $tip &&
	git symbolic-ref HEAD refs/heads/master &&
	git repack -ad
'

# A real many-pack situation would probably come from having a lot of pushes
# over time. We don't know how big each push would be, but we can fake it by
# just walking the first-parent chain and having each commit be its own "push".
# This isn't _entirely_ accurate, as real pushes would have some duplicate
# objects due to thin-pack fixing, but it's a reasonable approximation.
#
# And then all of the rest of the objects can go in a single packfile that
# represents the state before any of those pushes (actually, we'll generate
# that first because in such a setup it would be the oldest pack, and we sort
# the packs by reverse mtime inside git).
#
# We prepare this in a staging area, because we need to install our baseline
# set of packs for each iteration of the perf test (which unfortunately counts
# against their times, but is a limitation of the perf framework).
test_expect_success 'create a large number of packs' '
	mkdir staging &&

	pushes() {
		git rev-list --first-parent -1000 HEAD
	} &&
	pack() {
		git pack-objects --revs --delta-base-offset staging/pack
	} &&

	bottom=$(pushes | tail -n 1) &&
	echo "$bottom^" | pack &&

	pushes |
	while read rev
	do
		printf "%s\n^%s^" $rev $rev | pack ||
			return 1
	done &&

	setup_many_packs () {
		rm -f .git/objects/pack/* &&
		cp staging/* .git/objects/pack/
	} &&
	setup_many_packs
'

test_perf 'rev-list' '
	git rev-list --objects --all
'

test_perf 'full repack' '
	setup_many_packs &&
	git repack -ad
'

test_done
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html