On Mon, Jul 16, 2018 at 12:52 PM, Jeff King <peff@xxxxxxxx> wrote: > On Mon, Jul 16, 2018 at 12:15:05PM -0700, Elijah Newren wrote: > >> The basic problem here, at least for us, is that gc has enough >> information to know it could expunge some objects, but because of how >> it is structured in terms of several substeps (reflog expiration, >> repack, prune), the information is lost between the steps and it >> instead writes them out as unreachable objects. If we could prune (or >> avoid exploding) loose objects that are only reachable from reflog >> entries that we are expiring, then the problem goes away for us. (I >> totally understand that other repos may have enough unreachable >> objects for other reasons that Peff's suggestion to just pack up >> unreachable objects is still a really good idea. But on its own, it >> seems like a waste since it's packing stuff that we know we could just >> expunge.) > > No, we should have expunged everything that could be during the "repack" > and "prune" steps. We feed the expiration time to repack, so that it > knows to drop objects entirely instead of exploding them loose. Um, except it doesn't actually do that. The testcase I provided shows that it leaves around 10000 objects that are totally deletable and were only previously referenced by reflog entries -- entries that gc removed without removing the corresponding objects. I will note that my testcase was slightly out-of-date; with current git it needs a call to 'wait_for_background_gc_to_finish' right before the 'git gc --quiet' to avoid erroring out. > You > could literally just do: > > find .git/objects/?? -type f | > perl -lne 's{../.{38}$} and print "$1$2"' | > git pack-objects .git/objects/pack/cruft-pack > > But: > > - that will explode them out only to repack them, which is inefficient > (if they're already packed, you can probably reuse deltas, not to > mention the I/O savings) > > - there's the question of how to handle timestamps. Some of those > objects may have been _about_ to expire, but now you've just put > them in a brand-new pack that adds another 2 weeks to their life > > - the find above is sloppy, and will race with somebody adding new > objects to the repo > > So probably you want to have pack-objects write out the list of objects > it _would_ explode, rather than exploding them. And then before > git-repack deletes the old packs, put those into a new cruft pack. That > _just_ leaves the timestamp issue (which is discussed at length in the > thread I linked earlier). > >> git_actual_garbage_collect() { >> GITDIR=$(git rev-parse --git-dir) >> >> # Record all revisions stored in reflog before and after gc >> git rev-list --no-walk --reflog >$GITDIR/gc.original-refs >> git gc --auto >> wait_for_background_gc_to_finish >> git rev-list --no-walk --reflog >$GITDIR/gc.final-refs >> >> # Find out which reflog entries were removed >> DELETED_REFS=$(comm -23 <(sort $GITDIR/gc.original-refs) <(sort $GITDIR/gc.final-refs)) > > This is too detailed, I think. There are other reasons to have > unreachable objects than expired reflogs. I think you really just want > to consider all unreachable objects (like the pack-objects thing I > mentioned above). Yes, like I said, coarse workaround and I never had time to create a real fix. But I thought the testcase might be useful as a demonstration of how git gc leaves around loose objects that were previously reference by reflogs that gc itself pruned.