Re: [PATCH] builtin/repack.c: invalidate MIDX only when necessary

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 25, 2020 at 11:14:41AM -0400, Derrick Stolee wrote:
> On 8/25/2020 10:41 AM, Taylor Blau wrote:
> > On Tue, Aug 25, 2020 at 09:14:19AM -0400, Derrick Stolee wrote:
> >> The code in builtin/repack.c looks good for sure. I have a quick question
> >> about this new test:
> >>
> >> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
> >> +	git multi-pack-index write &&
> >> +	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
> >> +	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
> >> +
> >> +	# Write a new pack that is unknown to the multi-pack-index.
> >> +	git hash-object -w </dev/null >blob &&
> >> +	git pack-objects $objdir/pack/pack <blob &&
> >> +
> >> +	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
> >> +	test_cmp_bin $objdir/pack/multi-pack-index \
> >> +		$objdir/pack/multi-pack-index.bak
> >> +'
> >> +
> >>
> >> You create an arbitrary blob, and then add it to a pack-file. Do we
> >> know that 'git repack' is definitely creating a new pack-file that makes
> >> our manually-created pack-file redundant?
> >>
> >> My suggestion is to have the test check itself:
> >>
> >> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
> >> +	git multi-pack-index write &&
> >> +	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
> >> +	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
> >> +
> >> +	# Write a new pack that is unknown to the multi-pack-index.
> >> +	git hash-object -w </dev/null >blob &&
> >> +	HASH=$(git pack-objects $objdir/pack/pack <blob) &&
> >> +
> >> +	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
> >> +	test_cmp_bin $objdir/pack/multi-pack-index \
> >> +		$objdir/pack/multi-pack-index.bak &&
> >> +	test_path_is_missing $objdir/pack/pack-$HASH.pack
> >> +'
> >> +
> >>
> >> This test fails for me, on the 'test_path_is_missing'. Likely, the
> >> blob is seen as already in a pack-file so is just pruned by 'git repack'
> >> instead. I thought that perhaps we need to add a new pack ourselves that
> >> overrides the small pack. Here is my attempt:
> >>
> >> test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
> >> 	git multi-pack-index write &&
> >> 	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
> >> 	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
> >>
> >> 	# Write a new pack that is unknown to the multi-pack-index.
> >> 	BLOB1=$(echo blob1 | git hash-object -w --stdin) &&
> >> 	BLOB2=$(echo blob2 | git hash-object -w --stdin) &&
> >> 	cat >blobs <<-EOF &&
> >> 	$BLOB1
> >> 	$BLOB2
> >> 	EOF
> >> 	HASH1=$(echo $BLOB1 | git pack-objects $objdir/pack/pack) &&
> >> 	HASH2=$(git pack-objects $objdir/pack/pack <blobs) &&
> >> 	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
> >> 	test_cmp_bin $objdir/pack/multi-pack-index \
> >> 		$objdir/pack/multi-pack-index.bak &&
> >> 	test_path_is_file $objdir/pack/pack-$HASH2.pack &&
> >> 	test_path_is_missing $objdir/pack/pack-$HASH1.pack
> >> '
> >>
> >> However, this _still_ fails on the "test_path_is_missing" line, so I'm not sure
> >> how to make sure your logic is tested. I saw that 'git repack' was writing
> >> "nothing new to pack" in the output, so I also tested adding a few commits and
> >> trying to force it to repack reachable data, but I cannot seem to trigger it
> >> to create a new pack that overrides only one pack that is not in the MIDX.
> >>
> >> Likely, I just don't know how 'git rebase' works well enough to trigger this
> >> behavior. But the test as-is is not testing what you want it to test.
> >
> > I think this case might actually be impossible to tickle in a test. I
> > thought that 'git repack -d' looked for existing packs whose objects are
> > a subset of some new pack generated. But, it's much simpler than that:
> > '-d' by itself just looks for packs that were already on disk with the
> > same SHA-1 as a new pack, and it removes the old one.
>
> If 'git repack' never calls remove_redundant_pack() unless we are doing
> a "full repack", then we could simplify this logic:
>
>  static void remove_redundant_pack(const char *dir_name, const char *base_name)
>  {
>  	struct strbuf buf = STRBUF_INIT;
> -	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
> +	struct multi_pack_index *m = get_multi_pack_index(the_repository);
> +	strbuf_addf(&buf, "%s.pack", base_name);
> +	if (m && midx_contains_pack(m, buf.buf))
> +		clear_midx_file(the_repository);
> +	strbuf_insertf(&buf, 0, "%s/", dir_name);
>  	unlink_pack_path(buf.buf, 1);
>  	strbuf_release(&buf);
>  }
>
> to
>
>  static void remove_redundant_pack(const char *dir_name, const char *base_name)
>  {
>  	struct strbuf buf = STRBUF_INIT;
> 	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
> +	clear_midx_file(the_repository);
>  	unlink_pack_path(buf.buf, 1);
>  	strbuf_release(&buf);
>  }
>
> and get the same results as we are showing in these tests. This does
> move us incrementally to a better situation: don't delete the MIDX
> if we aren't deleting pack files. But, I think we can get around it.

Makes sense, but reading your whole email we are better off leaving this
as-is and changing the test to exercise it more often.

> > Note that 'git repack' uses 'git pack-objects' internally to find
> > objects and generate a packfile. When calling 'git pack-objects', 'git
> > repack -d' passes '--all' and '--unpacked', which means that there is no
> > way we'd generate a new pack with the same SHA-1 as an existing pack
> > ordinarily.
> >
> > So, I think this case is impossible, or at least astronomically
> > unlikely. What is more interesting (and untested) is that adding a _new_
> > pack doesn't cause us to invalidate the MIDX. Here's a patch that does
> > that:
> >
> >   diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> >   index 16a1ad040e..620f2058d6 100755
> >   --- a/t/t5319-multi-pack-index.sh
> >   +++ b/t/t5319-multi-pack-index.sh
> >   @@ -391,18 +391,27 @@ test_expect_success 'repack removes multi-pack-index when deleting packs' '
> >           test_path_is_missing $objdir/pack/multi-pack-index
> >    '
> >
> >   -test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
> >   -       git multi-pack-index write &&
> >   -       cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
> >   -       test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
> >   -
> >   -       # Write a new pack that is unknown to the multi-pack-index.
> >   -       git hash-object -w </dev/null >blob &&
> >   -       git pack-objects $objdir/pack/pack <blob &&
> >   -
> >   -       GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
> >   -       test_cmp_bin $objdir/pack/multi-pack-index \
> >   -               $objdir/pack/multi-pack-index.bak
> >   +test_expect_success 'repack preserves multi-pack-index when creating packs' '
> >   +       git init preserve &&
> >   +       test_when_finished "rm -fr preserve" &&
> >   +       (
> >   +               cd preserve &&
> >   +               midx=.git/objects/pack/multi-pack-index &&
> >   +
> >   +               test_commit "initial" &&
> >   +               git repack -ad &&
> >   +               git multi-pack-index write &&
> >   +               ls .git/objects/pack | grep "\.pack$" >before &&
> >   +
> >   +               cp $midx $midx.bak &&
> >   +
> >   +               test_commit "another" &&
> >   +               GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
> >   +               ls .git/objects/pack | grep "\.pack$" >after &&
> >   +
> >   +               test_cmp_bin $midx.bak $midx &&
> >   +               ! test_cmp before after
> >   +       )
> >    '
>
> After looking at the callers to remote_redundant_pack() I noticed that it is only
> called after inspecting the "names" struct, which contains the names of the packs
> to group into a new pack-file. We can use a .keep file to preserve the pack-file(s) in
> the MIDX but also to ensure multiple pack-files outside of the MIDX are repacked and
> deleted. While this is very unlikely in the wild, it is definitely possible.

Great idea.

> test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
> 	git init preserve &&
> 	test_when_finished "rm -fr preserve" &&
> 	(
> 		cd preserve &&
> 		midx=.git/objects/pack/multi-pack-index &&
>
> 		test_commit 1 &&
> 		HASH1=$(git pack-objects --all .git/objects/pack/pack) &&
> 		touch .git/objects/pack/pack-$HASH1.keep &&
>
> 		cat >pack-input <<-\EOF &&

Escaping the heredoc shouldn't be necessary, so this can be written
instead as '<<-EOF'.

> 		HEAD
> 		^HEAD~1
> 		EOF
>
> 		test_commit 2 &&
> 		HASH2=$(git pack-objects --revs .git/objects/pack/pack <pack-input) &&
> 		touch .git/objects/pack/pack-$HASH2.keep &&
>
> 		git multi-pack-index write &&
> 		cp $midx $midx.bak &&
>
> 		test_commit 3 &&
> 		HASH3=$(git pack-objects --revs .git/objects/pack/pack <pack-input) &&
>
> 		test_commit 4 &&
> 		HASH4=$(git pack-objects --revs .git/objects/pack/pack <pack-input) &&
>
> 		GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -ad &&
> 		test_path_is_file .git/objects/pack/pack-$HASH1.pack &&
> 		test_path_is_file .git/objects/pack/pack-$HASH2.pack &&
> 		test_path_is_missing .git/objects/pack/pack-$HASH3.pack &&
> 		test_path_is_missing .git/objects/pack/pack-$HASH4.pack

...and we should check that 'test_cmp $midx.bak $midx' is clean, i.e.,
that we didn't touch the MIDX.

>        )
> '
>
> I believe this checks your condition properly enough.

Otherwise, I think that this test looks great. Thanks for suggesting
it. I'll send a new patch now...

> Thanks,
> -Stolee

Thanks,
Taylor



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux