On Tue, Mar 19 2019, Jeff King wrote: > On Mon, Mar 18, 2019 at 11:45:39PM +0100, Ævar Arnfjörð Bjarmason wrote: > >> > I don't think the quarantine stuff should impact contention at all. It's >> > only quarantining the objects, which are the least contentious part of >> > Git (because object content is idempotent, so we don't do any locking >> > there, and with two racing processes, one will just "win"). >> >> Without the quarantine, isn't there the race that the NOTES section >> talks about (unless I've misread it). > > Ah, OK, I wasn't quite sure which documentation you were talking about. > I see the discussion now in the "NOTES" section of git-gc(1). > >> I.e. we have some loose object "ABCD" not referrred to by anything for >> the last 2 weeks, as we're gc-ing a ref update comes in that makes it >> referenced again. We then delete "ABCD" (not used!) at the same time the >> ref update happens, and get corruption. >> >> Whereas the quarantine might work around since the client will have sent >> ABCD with no reference pointing to it to the server in the temp pack, >> which we then rename in-place and then update the ref, so we don't care >> if "ABCD" goes away. > > tl;dr I don't think quarantine impacts this, but if you really want gory > details, read on. > > This is a problem with or without the quarantine. It's fundamentally a > race because we do not atomically say "is anybody using X? If not, we > can delete it" and some other process saying "I'd like to use X". > > Pushes are actually better off than most operations, because we only > advertise what's reachable, and the client is expected to send > everything else. So with just a normal update-ref call, we could race > like this: > > 1. ABCD is ancient. > > 2. Process 1 (update-ref) wants to reference ABCD. It sees that we > have it. > > 3. Process 2 (gc/prune) sees that nobody references it. It deletes > ABCD. > > 4. Process 1 writes out the reference. > > That doesn't happen with a push, because the server never would have > told the client that it has ABCD in the first place (so process 1 here > is the client). That is true with or without quarantine. > > But pushes aren't foolproof either. You said "loose object ABCD not > referred t oby anything for the last 2 weeks". But that's not exactly > how it works. It's "object with an mtime of more than 2 weeks which is > not currently referenced". So imagine a sequence like: > > 1. ABCD is ancient. > > 2. refs/heads/foo points to ABCD. > > 3. Server receive-pack advertises foo pointing to ABCD. > > 4. Simultaneous process on the server deletes refs/heads/foo (or > perhaps somebody force-pushes over it). > > 5. Client prepares and sends pack without ABCD. > > 6. Server receive-pack checks that yes, we still have ABCD (i.e., the > usual connectivity check). > > 7. Server gc drops ABCD, which is now unreachable (reflogs can help > here, if you've enabled them; but we do delete reflogs when the > branch is deleted). > > 8. Server receive-pack writes corrupt repo mentioning ABCD. > > That's a lot more steps, though they might not be as implausible as you > think (e.g., consider moving "refs/heads/foo" to "refs/heads/bar" in a > single push; that's actually a delete and an update, which is all you > need to race with a simultaneous gc). > > I have no idea how often this happens in practice. My subjective > recollection is that most of the race corruptions I've seen were from > local operations on the server. E.g., we compute a tentative merge for > somebody's pull request which shares objects with an older tentative > merge. They click the "merge" button and we reference that commit, which > is recent, but unbeknownst to us, while we were creating our new > tentative merge, a "gc" was deleting the old one. > > We're sometimes saved by the "transitive freshness" rules in d3038d22f9 > (prune: keep objects reachable from recent objects, 2014-10-15). But > they're far from perfect: > > - some operations (like the push rename example) aren't writing new > objects, so the ref write _is_ the moment that gc would find out > something is reachable > > - the "is it reachable?" and "no, then delete it" steps aren't atomic. > Unless you want a whole-repo stop-the-world lock, somebody can > reference the object in between. And since it may take many seconds > to compute reachability, stop-the-world is not great. > > I think there are probably ways to make it better. Perhaps some kind of > lockless delete-but-be-able-to-rollback scheme (but keep in mind this > has to be implemented no top of POSIX filesystem semantics). Or even > just a "compute reachability, mark for deletion, and then hold a > stop-the-world lock briefly to double-check that our reachability is > still up to date". > > At least those seem plausible to me. I've never worked out the details, > and our solution was to just stop deleting objects during routine > maintenance (using "repack -adk"). We do still occasionally prune > manually (e.g., when somebody writes to support to remove a confidential > mistake). > > Anyway, that was more than you probably wanted to know. The short of it > is that I don't think quarantines help (they may even make things worse > by slightly increasing the length of the race window, though in practice > I doubt it). > >> Unless that interacts racily with the receive.unpackLimit, but then I >> have no idea that section is trying to say... > > No, I don't think unpackLimit really affects it much either way. > >> Also, surely the part where "NOTES" says something to the effect of "you >> are subject to races unless gc.auto=0" is wrong. To the extent that >> there's races it won't matter that you invoke "git gc" or "git gc >> --auto", it's the concurrency that matters. So if there's still races we >> should be saying the repo needs to be locked for writes for the duration >> of the "gc". > > Correct. It's the very act of pruning that is problematic. I think the > point is that if you are manually running "git gc", you'd presumably do > it at a time when the repository is not otherwise active. Thanks for that E-Mail. I'm hoping to get around to another set of "gc doc" updates and will hopefully be able to steal liberally from it. Maybe there's some case I haven't thought of that makes this stupid, but I wonder if something like a "gc quarantine" might be a fix fo both of the the issues you noted above. I.e. it seems to me that the main issue is that we conflate "mtime 2 weeks old because it's unreferenced for 2 weeks" v.s. "mtime 2 weeks old because we haven't gotten around to a 'gc'". So in such a "gc quarantine" mode when we discover an object/pack that's unreachable/purely made up of unreachable objects we'd move the relevant loose object/"loose" pack to such a quarantine, which would just be .git/unreferenced-objects/{??,pack}/ or whatever. AFAICT both cases you mentioned above would be mitigated by this because we'd no longer conflate "haven't gc'd this yet and it's 2 weeks old" v.s. "hasn't been referenced in 2 weeks". I started looking at this initially because I was wondering if the --keep-unreachable mode you modified in e26a8c4721 ("repack: extend --keep-unreachable to loose objects", 2016-06-13) could be made to write out such "unreferenced" objects into their *own* pack, so we could delete them all at once as a batch, and wouldn't create the "ref explosions" mentioned in [1]. But of course without an accompanying quarantine described above doing that would just make this race condition worse. 1. https://public-inbox.org/git/87fu6bmr0j.fsf@xxxxxxxxxxxxxxxxxxx/