Re: Flurries of 'git reflog expire'

Bryan Turner <bturner@xxxxxxxxxxxxx> · Tue, 11 Jul 2017 00:35:50 -0700

On Mon, Jul 10, 2017 at 9:45 PM, Andreas Krey <a.krey@xxxxxx> wrote:
> On Thu, 06 Jul 2017 10:01:05 +0000, Bryan Turner wrote:
> ....
>> I also want to add that Bitbucket Server 5.x includes totally
>> rewritten GC handling. 5.0.x automatically disables auto GC in all
>> repositories and manages it explicitly, and 5.1.x fully removes use of
>> "git gc" in favor of running relevant plumbing commands directly.
>
> That's the part that irks me. This shouldn't be necessary - git itself
> should make sure auto GC isn't run in parallel. Now I probably can't
> evaluate whether a git upgrade would fix this, but given that you
> are going the do-gc-ourselves route I suppose it wouldn't.
>

I believe I've seen some commits on the mailing list that suggest "git
gc --auto" manages its concurrency better in newer versions than it
used to, but even then it can only manage its concurrency within a
single repository. For a hosting server with thousands, or tens of
thousands, of active repositories, there still wouldn't be any
protection against "git gc --auto" running concurrently in dozens of
them at the same time.

But it's not only about concurrency. "git gc" (and by extension "git
gc --auto") is a general purpose tool, designed to generally do what
you need, and to mostly stay out of your way while it does it. I'd
hazard to say it's not really designed for managing heavily-trafficked
repositories on busy hosting services, though, and as a result, there
are things it can't do.

For example, I can configure auto GC to run based on how many loose
objects or packs I have, but there's no heuristic to make it repack
refs when I have a lot of loose ones, or configure it to _only_ pack
refs without repacking objects or pruning reflogs. There are knobs for
various things (like "gc.*.reflogExpire"), but those don't give
complete control. Even if I set "gc.reflogExpire=never", "git gc"
still forks "git reflog expire --all" (compared to
"gc.packRefs=false", which completely prevents forking "git
pack-refs").

A trace on "git gc" shows this:
$ GIT_TRACE=1 git gc
00:10:45.058066 git.c:437               trace: built-in: git 'gc'
00:10:45.067075 run-command.c:369       trace: run_command:
'pack-refs' '--all' '--prune'
00:10:45.077086 git.c:437               trace: built-in: git
'pack-refs' '--all' '--prune'
00:10:45.084098 run-command.c:369       trace: run_command: 'reflog'
'expire' '--all'
00:10:45.093102 git.c:437               trace: built-in: git 'reflog'
'expire' '--all'
00:10:45.097088 run-command.c:369       trace: run_command: 'repack'
'-d' '-l' '-A' '--unpack-unreachable=2.weeks.ago'
00:10:45.106096 git.c:437               trace: built-in: git 'repack'
'-d' '-l' '-A' '--unpack-unreachable=2.weeks.ago'
00:10:45.107098 run-command.c:369       trace: run_command:
'pack-objects' '--keep-true-parents' '--honor-pack-keep' '--non-empty'
'--all' '--reflog' '--indexed-objects'
'--unpack-unreachable=2.weeks.ago' '--local' '--delta-base-offset'
'objects/pack/.tmp-15212-pack'
00:10:45.127117 git.c:437               trace: built-in: git
'pack-objects' '--keep-true-parents' '--honor-pack-keep' '--non-empty'
'--all' '--reflog' '--indexed-objects'
'--unpack-unreachable=2.weeks.ago' '--local' '--delta-base-offset'
'objects/pack/.tmp-15212-pack'
Counting objects: 6, done.
Delta compression using up to 16 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 0), reused 6 (delta 0)
00:10:45.173161 run-command.c:369       trace: run_command: 'prune'
'--expire' '2.weeks.ago'
00:10:45.184171 git.c:437               trace: built-in: git 'prune'
'--expire' '2.weeks.ago'
00:10:45.199202 run-command.c:369       trace: run_command: 'worktree'
'prune' '--expire' '3.months.ago'
00:10:45.208193 git.c:437               trace: built-in: git
'worktree' 'prune' '--expire' '3.months.ago'
00:10:45.212198 run-command.c:369       trace: run_command: 'rerere' 'gc'
00:10:45.221223 git.c:437               trace: built-in: git 'rerere' 'gc'

The bare repositories used by Bitbucket Server:
* Don't have reflogs enabled generally, and for the ones that are
enabled "gc.*.reflogExpire" is set to "never"
* Never have worktrees, so they don't need to be pruned
* Never use rerere, so that doesn't need to GC
* Have pruning disabled if they've been forked, due to using
alternates to manage disk space

That means of all the commands "git gc" runs, under the covers, at
most only "pack-refs", "repack" and sometimes "prune" have any value.
"reflog expire --all" in particular is extremely likely to fail. Which
brings up another consideration.

"git gc --auto" has no sense of context, or adjacent behavior. Even if
it correctly guards against concurrency, it still doesn't know what
else is going on. Immediately after a push, Bitbucket Server has many
other housekeeping tasks it performs, especially around pull requests.
That means pull request refs are disproportionately likely to be
"moving" immediately after a push completes--exactly when "git gc
--auto" tries to run. (Which tends to be why "reflog expire --all"
fails, due ref locking issues with pull request refs.) Bitbucket
Server, on the other hand, better understands the context GC is
running in. So it can defer GC processing for a period of time after a
push completes, to increase the likelihood that the repository is
"quiet" and GC can complete without issue.

Another limitation is that you can't configure "negative" heuristics,
like "Don't run GC more than once per day.". If "git gc --auto"'s
heuristics are exceeded, it'll run GC. Depending, for example, on how
rapidly a repository generates unreachable objects, it's entirely
possible to get to a point where "git gc --auto" wants to run after
every single push, sometimes for days in a row, while it waits for
objects to hit the prune threshold. By managing GC ourselves, we gain
the ability to enforce "cooldowns" to prevent continuous GC.

"git gc --auto" also has a tendency to run "attached" to the "git
receive-pack" process, which means both that pushing users can have
their local process "delayed" while it runs, and that they sometimes
get to see "scary" errors that they can't fix (or, often, understand).
Newer versions of Git have increased the likelihood that "git gc
--auto" will run detached, but that doesn't always happen. (Up to and
including 2.13.2, the "git config" documentation for "gc.autoDetach"
is qualified with "if the system supports it.") Managing GC in
Bitbucket Server guarantees that it's _always_ detached from user
processes.

That's a few of the reasons we've switched over. I'd imagine most
hosting providers take a similarly "hands on" approach to controlling
their GC. Beyond a certain scale, it seems almost unavoidable. Git
never has more than a repository-level view of the world; only the
hosting provider can see the big picture.

Best regards,
Bryan Turner

> ...
>> Upgrading to 5.x can be a bit of an undertaking, since the major
>> version brings API changes,
>
> The upgrade is on my todo list, but there are plugins that don't
> appear to be ready for 5.0, notable the jenkins one.
>
> Andreas
>
> --
> "Totally trivial. Famous last words."
> From: Linus Torvalds <torvalds@*.org>
> Date: Fri, 22 Jan 2010 07:29:21 -0800