On Wed, Jun 29, 2022 at 03:54:04PM -0700, Jonathan Tan wrote: > Taylor Blau <me@xxxxxxxxxxxx> writes: > > This series is an RFC for now since I'm interested in discussing whether > > or not this is a feature that people would actually want to use or not. > > But if it is, I'm happy to polish this up and turn it into a > > non-RFC-quality series ;-). > > > > In the meantime, thanks for your review! > > Thanks for this patch set. I can see this being used by, say, someone > who wants to preserve a repo that rewinds branches all the time (the > refs would need to be backed-up separately, but at least this provides a > way for objects to be stored efficiently, in that reachable objects are > still stored in the main repo and unreachable objects are stored in the > backup with no overlap between them). Yes, definitely. If it helps, I can share a little bit about the motivating use-case within GitHub. All objects from a fork network are stored together in a repository that we call the network.git, with individual forks keeping track of their own references. The network.git repository can often grow quite large, and/or contain data that the owner of an individual fork would like removed (e.g., they accidentally pushed sensitive credentials, force-pushed over it, but would like the now-unreachable objects to be removed). We don't usually do pruning GC's except during manual intervention or upon request through a support ticket. But when we do it is often infeasible to lock the entire network's push traffic and reference updates. So it is not an unheard of event to encounter the race that I described above. The idea is that, at least for non-sensitive pruning, we would move the pruned objects to a separate repository and hold them there until we could run `git fsck` on the repository after pruning and verify that the repository is intact. If it is, then the expired.git repository can be emptied, too, permanently removing the pruned objects. If not, the expired.git repository then becomes a donor for the missing objects, which are used to heal the corrupt main repository. Once *that* is done, and fsck comes back clean, then the expired.git repository can be removed. > I think there is at least one more alternative that should be > considered, though: since the cruft pack is unlikely to have its objects > "resurrected" (since the reason why they're there is because they are > unreachable), it is likely that the objects that are pruned are exactly > the same as those in the craft pack. So it would be more efficient to > just unconditionally rename the cruft pack to the backup destination. This isn't quite right. The contents that are written into the expired.git repository is everything that *didn't* end up in the cruft pack. Suppose your cruft expiration is 1.hour.ago, and your doing a repack on repository foo.git, expiring objects into expired.git. There are three disjoint sets of objects: - reachable objects, which will stay in foo.git - unreachable objects which were written within the last hour (and are thus too new to prune) which will stay in foo.git - unreachable objects which *weren't* written within the last hour (and thus will be pruned) which are moved to a new pack in expired.git (and removed from foo.git) So the cruft pack in foo.git and the one written to expired.git are a disjoint cut of the unreachable objects in foo.git based on their mtime, with the recent objects staying in the source repository and the stale ones moving to the expired.git repository. The original implementation of this feature was to move the entire cruft pack out of the way like you describe. This is sub-optimal because you are forced to generate that cruft pack with `--cruft-expiration=never`, since you can't actually prune any objects when generating the cruft pack, or they would be gone forever. But since you have to move the entire cruft pack out of the way, the visible effect looks like you actually pruned *all* unreachable objects, as if you had supplied `--cruft-expiration=now`. Being able to expire just the objects which have aged out of the grace period should cause this race to happen less frequently in practice. > Having said that, if there is a compelling use case for repacking even > when we're moving from cruft pack to backup, the design of this patch > set looks good overall. There are some minor points (e.g. the naming of > the parameter "out" in patch 1), but I understand that this is an RFC > and I'll wait for a non-RFC patch set before looking more closely at > these things. Thanks, Taylor