On Thu, Dec 08 2022, Jeff King wrote: > On Thu, Dec 08, 2022 at 12:57:45AM +0100, Ævar Arnfjörð Bjarmason wrote: > >> Is this for a very large hosting site that's anywhere near GitHub, >> GitLab's etc. scale? >> >> A "git gc" on a "live" repo is always racy in theory, but the odds that >> you'll run into data corrupting trouble tends to approach zero as you >> increase the gc.pruneExpire setting, with the default 2 weeks being more >> than enough for even the most paranoid user. > > I'm a bit less optimistic than "tends to approach zero" here. The > objects themselves might be older, but they may become referenced or > unreferenced in an immediate racy way. E.g., intending to move branch Z > to branch A, a client asks to add A and remove Z. Since there is no > atomic view of the ref namespace, a simultaneous gc might see Z gone, > but A not yet existing (and of course it could also be _two_ clients, > etc). Yes, I'm really hand-waiving away the issue there for brevity. You've got a really good summary of why exactly that race happens somewhere in the list archive, which I'm not digging up now. But (and just for the general audience here, I know you know this) that race basically happens because "gc" concurrently decides that say a 2 week old blob containing "foo" isn't referenced, *and* we have a concurrent branch push that happens to reference a "foo" blob, along with a concurrent "gc". I have run into this once or twice in practice (with a very high volume in-house git server), and in those cases it was because person "B" doing the new push was using person's "A"'s work to push a new branch, after it had been ~gc.pruneExpire since the topic branch for "A" was simultaneously being deleted. In principle what I noted upthread is completely false, but I think in practice it's almost always true. I.e. users aren't pushing random blobs, and as time goes by the odds that the exact same content is re-pushed go down. It's also worth noting that some repositories are much more vulnerable to this than others. If you have predictable auto-generated content in your repo (think the package+version list some languages routinely carry in-tree) you're much more likely to run into this: Someone bumped that for a topic ~2 weeks ago, nobody else bothered on any of their branches, and then the "A"+"B" race described above happens. As some practical advice to those running "gc" on live repos: To easily mitigate this run expiry on the least busy hours of the day. Even for truly global development teams there's usually a big lull when it's high noon in the middle of the Pacific. >> The "cruft pack" facility does many different things, and my >> understanding of it is that GitHub's not using it only as an end-run >> around potential corruption issues, but that some not yet in tree >> patches on top of it allow more aggressive "gc" without the fear of >> corruption. > > I don't think cruft packs themselves help against corruption that much. > For many years, GitHub used "repack -k" to just never expire objects. > What cruft packs help with is: > > 1. They keep cruft objects out of the main pack, which reduces the > costs of lookups and bitmaps for the main pack. > > 2. When you _do_ choose to expire, you can do so without worrying > about accidentally exploding all of those old objects into loose > ones (which is not wrong from a correctness point of view, but can > have some amazingly bad performance characteristics). > > I think the bits you're thinking of on top are in v2.39. The "repack > --expire-to" option lets you write objects that _would_ be deleted into > a cruft pack, which can serve as a backup (but managing that is out of > scope for repack itself, so you have to roll your own strategy there). Yes, that's what I was referring to. I think I had feedback on that series saying that if held correctly this would also nicely solve that long-time race. Maybe I'm just misremembering, but I (mis?)recalled that Taylor indicated that it was being used like that at GitHub. Another thing that really helps to mitigate it is this never-in-tree patch of mine (which ran in busy production for years, and probably still): https://lore.kernel.org/git/20181028225023.26427-1-avarab@xxxxxxxxx/ It's sensitive to "transfer.unpackLimit" if it's going to help with that, and even if you always write duplicate content it won't fully help with the race, as "B" may have seen the old ref, and hence not sent the "foo" blob over (so the client would need to not have a copy of "A"'s about-to-be-deleted topic). All of which described a setup I ran it in, so *if* we ever ran into the race then most likely we'd just have written duplicate content in the incoming PACK, so we'd happily chug along.