Re: [PATCH] builtin/repack.c: prune unreachable objects with `--expire-to`

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Nov 30, 2024 at 07:01:03PM -0500, Taylor Blau wrote:

> When invoked with '--expire-to', 'git repack' will move unreachable
> objects beyond the grace period to a separate repository outside of the
> main object store.
> 
> Later on, 'git repack' will remove any existing packs which were made
> redundant by the 'repack' operation, before then pruning loose objects
> which were packed. Ordinarily, unreachable objects which have expired
> were already packed via some earlier 'repack' operation, and so are
> removed from the main repository in the first step.
> 
> But if a repository has unreachable objects which:
> 
>   - have an mtime earlier than the --cruft-expiration period,
>   - are loose, and
>   - have never been packed
> 
> Then we'll create a pack containing those objects to store in the
> repository specified by the '--expire-to' option, but never prune the
> loose copies of those objects from the main repository. That's because
> we don't have a pack in the main repository which contains those
> objects, so prune_packed_objects() skips over them.
> 
> (As an aside, for repositories that have a large number of unreachable
> objects which were never packed, and are old enough to be expired, this
> can be quite painful. That's because even though we expect the repack to
> prune those objects which were GC'd, we don't per the above).
> 
> Teach repack to add the repository specified by '--expire-to' as an
> alternate of the main object store so that 'prune_packed_objects()' can
> "see" the packed copy of those objects, and remove them appropriately.

I think the solution you came up with makes sense in the confines of
what "repack --cruft" and "--expire-to" do. But I can't help but feel
that we may have taken a wrong (or at least somewhat confusing) turn
before this.

That is, why are we expecting "repack" to prune those loose objects at
all? In a traditional "repack -ad" they would not be removed, because
they are not packed and not reachable. It is the responsibility of a
follow-up "git prune" to get rid of them, and that is run automatically
by "git gc".

When we added in cruft packs, now those objects are "packed" in the
sense that we are storing them in a cruft pack instead of loose. But
they are not really "packed" in the sense that the original "repack"
would do. They're not in a pack we intend to keep, and the storage in a
cruft pack is mostly an incidental optimization over the loose format.
But "repack" started deleting them (because it knew they were moved into
a cruft pack), rather than waiting for "git prune" to do so. Which is a
little weird, but OK.

But then --expire-to adds extra confusion because we are _really_ not
packing them in the repository now. They are going to some magical
out-of-repo spot. Which yes, happens to be a pack. But I think one could
argue that they are not packed in the repo at that point, and it the
responsibility of "git prune" to get rid of them.


Now all of that is fairly philosophical. Is there a real-world benefit
to trying to retain this more purist view-point? I don't know. Certainly
I can think of some downsides, one of which is that running a follow-up
git-prune requires computing the full reachability again. That's both
expensive and racy with respect to what we saved in the cruft pack.

So I think there's some argument there that the pruning _has_ to be part
of the repack process for atomicity, and therefore --expire-to has to do
the same.

> @@ -1553,6 +1553,21 @@ int cmd_repack(int argc,
>  							&existing);
>  		if (show_progress)
>  			opts |= PRUNE_PACKED_VERBOSE;
> +
> +		if (expire_to && *expire_to) {
> +			char *alt = dirname(xstrdup(expire_to));
> +			size_t len = strlen(alt);
> +
> +			if (strip_suffix(alt, "pack", &len) &&
> +			    is_dir_sep(alt[len - 1])) {
> +				alt[len - 1] = '\0';
> +
> +				add_to_alternates_memory(alt);
> +				reprepare_packed_git(the_repository);
> +			}
> +
> +			free(alt);
> +		}

OK, so we are adding the containing directory as an alternate. That
gives me two concerns:

  1. The expire-to string is something like "path/to/objects/pack/pack",
     and we'll have created "path/to/objects/pack/pack-<hash>.pack"
     Using dirname() strips that down to "path/to/objects/pack". OK. And
     then we manually strip "pack/" off the end, which we have to do to
     get the "base" objects/ directory.

     But what if the path given by the user via --expire-to doesn't look
     like an object directory? I.e., does not end in "pack/"? Then this
     feature would not work at all.

     Should we be mentioning this in the git-repack docs?

     As an aside, I think the current --expire-to docs are misleading.
     They say:

       --expire-to=<dir>
	   Write a cruft pack containing pruned objects (if any) to the
	   directory <dir>. [...]

     But that isn't right. It is not a <dir> but a <base-name> similar
     to the one that pack-objects takes. If you do --expire-to=some/dir,
     then you'll get some/dir-<hash>.pack.

  2. Since we're adding the whole directory, that reprepare_packed_git()
     will also find any existing packs that were sitting in the
     --expire-to directory. If you are repeatedly stuffing cruft packs
     into the same directory every time you repack, then you'll see
     those older packs, and we'll prune objects that they mention. I'm
     not sure it's too bad from a correctness perspective to allow that
     (and I think they'd have ended up in the most recent cruft pack
     anyway). But I wonder if the expense of checking those packs
     eventually adds up.

One thing that could perhaps deal with both issues is to add the single
cruft pack itself, rather than the surrounding directory. I.e., would it
work to add_packed_git() and install_packed_git() each cruft pack
(should just be one, but I think in theory there may be multiple if
maxPackSize is enabled)?

It's maybe a little weird to have an entry in packed_git that doesn't
come from an object database that we're using. But I don't think there
are any problems with that (I'd probably pass "0" for the "local" flag
;) ).

The only other gotcha I see is that you probably need to
reprepare_packed_git() afterwards to make sure the packed_git_mru list
is refreshed.

-Peff




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux