On Thu, Dec 08, 2022 at 01:35:04PM +0100, Ævar Arnfjörð Bjarmason wrote: > >> The "cruft pack" facility does many different things, and my > >> understanding of it is that GitHub's not using it only as an end-run > >> around potential corruption issues, but that some not yet in tree > >> patches on top of it allow more aggressive "gc" without the fear of > >> corruption. > > > > I don't think cruft packs themselves help against corruption that much. > > For many years, GitHub used "repack -k" to just never expire objects. > > What cruft packs help with is: > > > > 1. They keep cruft objects out of the main pack, which reduces the > > costs of lookups and bitmaps for the main pack. Peff isn't wrong here, but there is a big caveat which is that this is only true when using a single pack bitmap. Single pack bitmaps are guaranteed to have reachability closure over their objects, but writing a MIDX bitmap after generating the MIDX does not afford us the same guarantees. So if you have a cruft pack which contains some unreachable object X, which is made reachable by some other object that *is* reachable from some reference, *and that* object is included in one of the MIDX's packs, then we won't have reachability closure unless we also bitmap the cruft pack, too. So even though it helps a lot with bitmapping in the single-pack case, in practice it doesn't make a significant difference with multi-pack bitmaps. > > 2. When you _do_ choose to expire, you can do so without worrying > > about accidentally exploding all of those old objects into loose > > ones (which is not wrong from a correctness point of view, but can > > have some amazingly bad performance characteristics). > > > > I think the bits you're thinking of on top are in v2.39. The "repack > > --expire-to" option lets you write objects that _would_ be deleted into > > a cruft pack, which can serve as a backup (but managing that is out of > > scope for repack itself, so you have to roll your own strategy there). > > Yes, that's what I was referring to. Yes, we use the `--expire-to` option when doing a pruning GC to move the expired objects out of the repo to some "../backup.git" location. The out-of-tree tools that Ævar is speculating is basically running `cat-file --batch` in the backup repo, feeding it the list of missing objects, and then writing those objects (back) into the GC'd repository. > I think I had feedback on that series saying that if held correctly this > would also nicely solve that long-time race. Maybe I'm just > misremembering, but I (mis?)recalled that Taylor indicated that it was > being used like that at GitHub. It (the above) doesn't solve the race, but it does make it easier to recover from a corrupt repository when we lose that race. Thanks, Taylor