On Mon, Jul 16, 2018 at 10:27 AM, Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote: > In a087cc9819 ("git-gc --auto: protect ourselves from accumulated > cruft", 2007-09-17), the user was warned if there were too many > unreachable loose objects. This made sense at the time, because gc > couldn't prune them safely. But subsequently, git prune learned the > ability to not prune recently created loose objects, making pruning able > to be done more safely, and gc was made to automatically prune old > unreachable loose objects in 25ee9731c1 ("gc: call "prune --expire > 2.weeks.ago" by default", 2008-03-12). ... > > --- > This was noticed when a daemonized gc run wrote this warning to the log > file, and returned 0; but a subsequent run merely read the log file, saw > that it is non-empty and returned -1 (which is inconsistent in that such > a run should return 0, as it did the first time). Yeah, we've hit this several times too. I even created a testcase and a workaround, though I never got it into proper submission form. The basic problem here, at least for us, is that gc has enough information to know it could expunge some objects, but because of how it is structured in terms of several substeps (reflog expiration, repack, prune), the information is lost between the steps and it instead writes them out as unreachable objects. If we could prune (or avoid exploding) loose objects that are only reachable from reflog entries that we are expiring, then the problem goes away for us. (I totally understand that other repos may have enough unreachable objects for other reasons that Peff's suggestion to just pack up unreachable objects is still a really good idea. But on its own, it seems like a waste since it's packing stuff that we know we could just expunge.) Anyway, my very rough testcase is below. The workaround is the git_actual_garbage_collect() function (minus the call to wait_for_background_gc_to_finish). Elijah --- wait_for_background_gc_to_finish() { while ( ps -ef | grep -v grep | grep --quiet git.gc.--auto ); do sleep 1; done } git_standard_garbage_collect() { # Current git gc sprays unreachable objects back in loose form; this is # fine in many cases, but is annoying when done with objects which # newly become unreachable because of something else git-gc does and # git-gc doesn't clean them up. git gc --auto wait_for_background_gc_to_finish } git_actual_garbage_collect() { GITDIR=$(git rev-parse --git-dir) # Record all revisions stored in reflog before and after gc git rev-list --no-walk --reflog >$GITDIR/gc.original-refs git gc --auto wait_for_background_gc_to_finish git rev-list --no-walk --reflog >$GITDIR/gc.final-refs # Find out which reflog entries were removed DELETED_REFS=$(comm -23 <(sort $GITDIR/gc.original-refs) <(sort $GITDIR/gc.final-refs)) # Get the list of objects which used to be reachable, but were made # unreachable due to gc's reflog expiration. To get these, I need # the intersection of things reachable from $DELETED_REFS and things # which are unreachable now. git rev-list --objects $DELETED_REFS --not --all --reflog | awk '{print $1}' >$GITDIR/gc.previously-reachable-objects git prune --expire=now --dry-run | awk '{print $1}' > $GITDIR/gc.unreachable-objects # Delete all the previously-reachable-objects made unreachable by the # reflog expiration done by git gc comm -12 <(sort $GITDIR/gc.unreachable-objects) <(sort $GITDIR/gc.previously-reachable-objects) | sed -e "s#^\(..\)#$GITDIR/objects/\1/#" | xargs rm } test -d testcase && { echo "testcase exists; exiting"; exit 1; } git init testcase/ cd testcase # Create a basic commit echo hello >world git add world git commit -q -m "Initial" # Create a commit with lots of files for i in {0000..9999}; do echo $i >$i; done git add [0-9]* git commit --quiet -m "Lots of numbers" # Pack it all up git gc --quiet # Stop referencing the garbage git reset --quiet --hard HEAD~1 # Pretend we did all the above stuff 30 days ago for rlog in $(find .git/logs -type f); do # Subtract 3E6 (just over 30 days) from every date (assuming dates have # exactly 10 digits, which just happens to be valid...right now at least) perl -i -ne '/(.*?)(\b[0-9]{10}\b)(.*)/ && print $1,$2-3000000,$3,"\n"' $rlog done # HOWEVER: note that the pack is new; if we make the pack old, the old objects # will get pruned for us. But it is quite common to have new packfiles with # old-and-soon-to-be-unreferenced-objects because frequent gc's mean moving # the objects to new packfiles often, and eventually the reflog is expired. # If you want to test them being part of an old packfile, uncomment this: # find .git/objects/pack -type f | xargs touch -t 200001010101 # Create 50 packfiles in the current repo so that 'git gc --auto' will # trigger `git repack -A -d -l` instead of just `git repack -d -l` for i in {01..50}; do git fast-export master | sed -e s/Initial/Initi$i/ | git -c fastimport.unpacklimit=0 fast-import --quiet |& grep -v "Not updating refs/heads/master" done echo "*** Before gc, reflog refers to garbage-collectible commits: ***" git rev-list --no-walk --all --reflog cat .git/logs/refs/heads/master echo "*** Before gc, everything is packed with no loose objects: ***" git count-objects -v git_standard_garbage_collect # Just `git gc --auto` #git_actual_garbage_collect # What I really want from `git gc --auto` echo -e "\n\n*** After gc, commit garbage collected and objects made loose: ***" git rev-list --no-walk --all --reflog cat .git/logs/refs/heads/master git count-objects -v echo -e "\n\n*** Now we can trigger the 'too many unreachable' error: ***" git gc --auto