Re: Question: How to execute git-gc correctly on the git server

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 08 Dec 2022 13:35:04 +0100

On Thu, Dec 08 2022, Jeff King wrote:

> On Thu, Dec 08, 2022 at 12:57:45AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> Is this for a very large hosting site that's anywhere near GitHub,
>> GitLab's etc. scale?
>> 
>> A "git gc" on a "live" repo is always racy in theory, but the odds that
>> you'll run into data corrupting trouble tends to approach zero as you
>> increase the gc.pruneExpire setting, with the default 2 weeks being more
>> than enough for even the most paranoid user.
>
> I'm a bit less optimistic than "tends to approach zero" here. The
> objects themselves might be older, but they may become referenced or
> unreferenced in an immediate racy way. E.g., intending to move branch Z
> to branch A, a client asks to add A and remove Z. Since there is no
> atomic view of the ref namespace, a simultaneous gc might see Z gone,
> but A not yet existing (and of course it could also be _two_ clients,
> etc).

Yes, I'm really hand-waiving away the issue there for brevity.

You've got a really good summary of why exactly that race happens
somewhere in the list archive, which I'm not digging up now.

But (and just for the general audience here, I know you know this) that
race basically happens because "gc" concurrently decides that say a 2
week old blob containing "foo" isn't referenced, *and* we have a
concurrent branch push that happens to reference a "foo" blob, along
with a concurrent "gc".

I have run into this once or twice in practice (with a very high volume
in-house git server), and in those cases it was because person "B" doing
the new push was using person's "A"'s work to push a new branch, after it
had been ~gc.pruneExpire since the topic branch for "A" was
simultaneously being deleted.

In principle what I noted upthread is completely false, but I think in
practice it's almost always true. I.e. users aren't pushing random
blobs, and as time goes by the odds that the exact same content is
re-pushed go down.

It's also worth noting that some repositories are much more vulnerable
to this than others.

If you have predictable auto-generated content in your repo (think the
package+version list some languages routinely carry in-tree) you're much
more likely to run into this: Someone bumped that for a topic ~2 weeks
ago, nobody else bothered on any of their branches, and then the "A"+"B"
race described above happens.

As some practical advice to those running "gc" on live repos: To easily
mitigate this run expiry on the least busy hours of the day. Even for
truly global development teams there's usually a big lull when it's high
noon in the middle of the Pacific.

>> The "cruft pack" facility does many different things, and my
>> understanding of it is that GitHub's not using it only as an end-run
>> around potential corruption issues, but that some not yet in tree
>> patches on top of it allow more aggressive "gc" without the fear of
>> corruption.
>
> I don't think cruft packs themselves help against corruption that much.
> For many years, GitHub used "repack -k" to just never expire objects.
> What cruft packs help with is:
>
>   1. They keep cruft objects out of the main pack, which reduces the
>      costs of lookups and bitmaps for the main pack.
>
>   2. When you _do_ choose to expire, you can do so without worrying
>      about accidentally exploding all of those old objects into loose
>      ones (which is not wrong from a correctness point of view, but can
>      have some amazingly bad performance characteristics).
>
> I think the bits you're thinking of on top are in v2.39. The "repack
> --expire-to" option lets you write objects that _would_ be deleted into
> a cruft pack, which can serve as a backup (but managing that is out of
> scope for repack itself, so you have to roll your own strategy there).

Yes, that's what I was referring to.

I think I had feedback on that series saying that if held correctly this
would also nicely solve that long-time race. Maybe I'm just
misremembering, but I (mis?)recalled that Taylor indicated that it was
being used like that at GitHub.

Another thing that really helps to mitigate it is this never-in-tree
patch of mine (which ran in busy production for years, and probably
still):
https://lore.kernel.org/git/20181028225023.26427-1-avarab@xxxxxxxxx/

It's sensitive to "transfer.unpackLimit" if it's going to help with
that, and even if you always write duplicate content it won't fully help
with the race, as "B" may have seen the old ref, and hence not sent the
"foo" blob over (so the client would need to not have a copy of "A"'s
about-to-be-deleted topic).

All of which described a setup I ran it in, so *if* we ever ran into the
race then most likely we'd just have written duplicate content in the
incoming PACK, so we'd happily chug along.