Re: [PATCH] builtin/pack-objects.c: introduce `pack.extraCruftTips`

Derrick Stolee <derrickstolee@xxxxxxxxxx> · Wed, 26 Apr 2023 06:52:26 -0400

On 4/25/2023 5:25 PM, Taylor Blau wrote:
> On Tue, Apr 25, 2023 at 03:42:51PM -0400, Derrick Stolee wrote:
>> On 4/20/2023 1:27 PM, Taylor Blau wrote:

>>> The implementation is straightforward: after determining the set of
>>> objects to pack into the cruft pack (either by calling
>>> `enumerate_cruft_objects()` or `enumerate_and_traverse_cruft_objects()`)
>>> we figure out if there are any program(s) we need to consult in order to
>>> determine if there are any objects which we need to "rescue". We then
>>> add those as tips to another reachability traversal, marking every
>>> object along the way as cruft (and thus retaining it in the cruft pack).
>>
>> I initially wondered why we would need a second walk, and I _think_
>> one reason is that we don't want to update the mtimes for these
>> objects as if they were reachable. The next point about the reachable
>> pack is also critical.
> 
> Their mtimes might be updated during that second walk, but the key point
> is that we want to rescue any objects that the extra tips might reach
> (but aren't listed themselves in the output of one of the hooks).
> 
> The idea there is similar to the rescuing pass we do for pruning GC's,
> where we'll save any objects which would have been pruned, but are
> reachable from another object which won't yet be pruned.
> 
>>> Instead, keep these unreachable objects in the cruft pack instead, to
>>> ensure that we can continue to have a pack containing just reachable
>>> objects, which is always safe to write a bitmap over.
>>
>> Makes sense. Also: don't update their mtime so they could be removed
>> immediately if the external pointers change.
> 
> I don't think I'm quite following why we would want to leave their
> mtimes alone here. I think we'd want their mtimes to be accurate as of
> the last time we updated the *.mtimes file so that if they (a) no longer
> appear in the output of one of the extra-cruft hooks, and (b) they have
> been written recently, that they would not be pruned immediately.

You corrected me that we _are_ updating mtimes, so my understanding was
incorrect and this follow-up is incorrect. Thanks.

Is it possible to demonstrate this mtime-updating in a test?

>>> +	if (progress)
>>> +		progress_state = start_progress(_("Enumerating extra cruft tips"), 0);
>>
>> Should we be including two progress counts? or would it be sufficient
>> to say "traversing extra cruft objects" and include both of these steps
>> in that one progress mode?
> 
> I may be missing something, but there are already two progress meters
> here: one (above) for enumerating the list of extra cruft tips, which
> increments each time we read a new line that refers to an extant object,
> and another (below) for the number of objects that we visit along the
> second walk.

I meant: why two counts? Should we instead only have one?

>>> +	if (prepare_revision_walk(&revs))
>>> +		die(_("revision walk setup failed"));
>>> +	if (progress)
>>> +		progress_state = start_progress(_("Traversing extra cruft objects"), 0);
>>> +	nr_seen = 0;
>>> +	traverse_commit_list(&revs, show_cruft_commit, show_cruft_object, NULL);
>>
>> Hm. I'm trying to look at these helpers and figure out what they
>> are doing to prevent these objects from being pruned. This could
>> use a comment or something, as I'm unable to grok it at present.
> 
> Very fair, the show_cruft_object() callback calls
> add_cruft_object_entry(), which adds the visited object to the cruft
> pack regardless of its mtime with respect to pruning.

Ah, thanks. That helps a lot.

> A comment here might look like:
> 
>     /* retain all objects reachable from extra tips via show_cruft_object() */
> 
> or something.

Sounds good.

>> Could we store this commit to make sure it is removed from the
>> repository later?
> 
> Possibly, though I think we already have good coverage of this in other
> tests. So I'd be equally happy to assert on the exact contents of the
> cruft pack (which wouldn't include the above commit), but I'm happy to
> assert on it more directly if you feel strongly.

This is a case of me thinking "can we test the test?" to make sure
the test is doing everything we think it is doing...

>>> +		git checkout --orphan old &&
>>> +		git rm -fr . &&
>>> +		test_commit --no-tag cruft.old &&
>>> +		cruft_old="$(git rev-parse HEAD)" &&
>>> +
>>> +		git checkout --orphan new &&
>>> +		git rm -fr . &&
>>> +		test_commit --no-tag cruft.new &&
>>> +		cruft_new="$(git rev-parse HEAD)" &&
>>> +
>>> +		git checkout main &&
>>> +		git branch -D old new &&
>>
>> Do you need to delete the 'discard' branch here?
> 
> Great catch. I didn't notice it here because even though it was
> reachable (and not mentioned as an extra tip), so didn't appear in the
> cruft pack.

...so we catch things like this automatically.

>>> +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
>>> +		git show-index <${mtimes%.mtimes}.idx >cruft &&
>>> +		cut -d" " -f2 cruft | sort >cruft.actual &&
>>> +		test_cmp cruft.expect cruft.actual
>>
>> One thing that would be good, as a safety check, is to remove
>> the pack.extraCruftTips config and re-run the repack and be
>> sure that we are pruning the objects previously saved by the
>> hook.
> 
> Good call. I think this amounts to:
> 
> --- 8< ---
> diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
> index 1260e11d41..44852e6925 100755
> --- a/t/t5329-pack-objects-cruft.sh
> +++ b/t/t5329-pack-objects-cruft.sh
> @@ -773,7 +773,7 @@ test_expect_success 'additional cruft tips may be specified via pack.extraCruftT
>  		cruft_new="$(git rev-parse HEAD)" &&
> 
>  		git checkout main &&
> -		git branch -D old new &&
> +		git branch -D discard old new &&
>  		git reflog expire --all --expire=all &&
> 
>  		# mark cruft.old with an mtime that is many minutes
> @@ -802,6 +802,19 @@ test_expect_success 'additional cruft tips may be specified via pack.extraCruftT
>  		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
>  		git show-index <${mtimes%.mtimes}.idx >cruft &&
>  		cut -d" " -f2 cruft | sort >cruft.actual &&
> +		test_cmp cruft.expect cruft.actual &&

If we store "discard=$(git rev-parse discard)" earlier, we can also
check

		test_must_fail git show $discard &&

> +		# Ensure that the "old" objects are removed after
> +		# dropping the pack.extraCruftTips hook.
> +		git config --unset pack.extraCruftTips &&
> +		git repack --cruft --cruft-expiration=now -d &&
> +
> +		mtimes="$(ls .git/objects/pack/pack-*.mtimes)" &&
> +		git show-index <${mtimes%.mtimes}.idx >cruft &&
> +		cut -d" " -f2 cruft | sort >cruft.actual &&
> +
> +		git rev-list --objects --no-object-names $cruft_new >cruft.raw &&

I like how $cruft_new is still kept because its mtime is still
"later than now".

> +		sort cruft.raw >cruft.expect &&
>  		test_cmp cruft.expect cruft.actual
>  	)
>  '
> --- >8 ---
> 
>> It would be helpful to include a test for the multi-value case where
>> multiple scripts are executed and maybe include an "exit 1" in some
>> of them?
> 
> Definitely, that is a great suggestion (especially because it would have
> caught the "if (!ret)" bug that you pointed out above).
> 
> Thanks for a thorough review. I'll wait for other feedback before
> re-rolling, or do so in a couple of days if I haven't heard anything.

Thanks,
-Stolee