Re: [PATCH 0/4] gc docs: modernize and fix the documentation

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Mon, 06 May 2019 11:44:06 +0200

On Tue, Mar 19 2019, Jeff King wrote:

> On Mon, Mar 18, 2019 at 11:45:39PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> > I don't think the quarantine stuff should impact contention at all. It's
>> > only quarantining the objects, which are the least contentious part of
>> > Git (because object content is idempotent, so we don't do any locking
>> > there, and with two racing processes, one will just "win").
>>
>> Without the quarantine, isn't there the race that the NOTES section
>> talks about (unless I've misread it).
>
> Ah, OK, I wasn't quite sure which documentation you were talking about.
> I see the discussion now in the "NOTES" section of git-gc(1).
>
>> I.e. we have some loose object "ABCD" not referrred to by anything for
>> the last 2 weeks, as we're gc-ing a ref update comes in that makes it
>> referenced again. We then delete "ABCD" (not used!) at the same time the
>> ref update happens, and get corruption.
>>
>> Whereas the quarantine might work around since the client will have sent
>> ABCD with no reference pointing to it to the server in the temp pack,
>> which we then rename in-place and then update the ref, so we don't care
>> if "ABCD" goes away.
>
> tl;dr I don't think quarantine impacts this, but if you really want gory
> details, read on.
>
> This is a problem with or without the quarantine. It's fundamentally a
> race because we do not atomically say "is anybody using X? If not, we
> can delete it" and some other process saying "I'd like to use X".
>
> Pushes are actually better off than most operations, because we only
> advertise what's reachable, and the client is expected to send
> everything else. So with just a normal update-ref call, we could race
> like this:
>
>   1. ABCD is ancient.
>
>   2. Process 1 (update-ref) wants to reference ABCD. It sees that we
>      have it.
>
>   3. Process 2 (gc/prune) sees that nobody references it. It deletes
>      ABCD.
>
>   4. Process 1 writes out the reference.
>
> That doesn't happen with a push, because the server never would have
> told the client that it has ABCD in the first place (so process 1 here
> is the client). That is true with or without quarantine.
>
> But pushes aren't foolproof either. You said "loose object ABCD not
> referred t oby anything for the last 2 weeks". But that's not exactly
> how it works. It's "object with an mtime of more than 2 weeks which is
> not currently referenced". So imagine a sequence like:
>
>   1. ABCD is ancient.
>
>   2. refs/heads/foo points to ABCD.
>
>   3. Server receive-pack advertises foo pointing to ABCD.
>
>   4. Simultaneous process on the server deletes refs/heads/foo (or
>      perhaps somebody force-pushes over it).
>
>   5. Client prepares and sends pack without ABCD.
>
>   6. Server receive-pack checks that yes, we still have ABCD (i.e., the
>      usual connectivity check).
>
>   7. Server gc drops ABCD, which is now unreachable (reflogs can help
>      here, if you've enabled them; but we do delete reflogs when the
>      branch is deleted).
>
>   8. Server receive-pack writes corrupt repo mentioning ABCD.
>
> That's a lot more steps, though they might not be as implausible as you
> think (e.g., consider moving "refs/heads/foo" to "refs/heads/bar" in a
> single push; that's actually a delete and an update, which is all you
> need to race with a simultaneous gc).
>
> I have no idea how often this happens in practice. My subjective
> recollection is that most of the race corruptions I've seen were from
> local operations on the server. E.g., we compute a tentative merge for
> somebody's pull request which shares objects with an older tentative
> merge. They click the "merge" button and we reference that commit, which
> is recent, but unbeknownst to us, while we were creating our new
> tentative merge, a "gc" was deleting the old one.
>
> We're sometimes saved by the "transitive freshness" rules in d3038d22f9
> (prune: keep objects reachable from recent objects, 2014-10-15).  But
> they're far from perfect:
>
>  - some operations (like the push rename example) aren't writing new
>    objects, so the ref write _is_ the moment that gc would find out
>    something is reachable
>
>  - the "is it reachable?" and "no, then delete it" steps aren't atomic.
>    Unless you want a whole-repo stop-the-world lock, somebody can
>    reference the object in between. And since it may take many seconds
>    to compute reachability, stop-the-world is not great.
>
> I think there are probably ways to make it better. Perhaps some kind of
> lockless delete-but-be-able-to-rollback scheme (but keep in mind this
> has to be implemented no top of POSIX filesystem semantics). Or even
> just a "compute reachability, mark for deletion, and then hold a
> stop-the-world lock briefly to double-check that our reachability is
> still up to date".
>
> At least those seem plausible to me. I've never worked out the details,
> and our solution was to just stop deleting objects during routine
> maintenance (using "repack -adk"). We do still occasionally prune
> manually (e.g., when somebody writes to support to remove a confidential
> mistake).
>
> Anyway, that was more than you probably wanted to know. The short of it
> is that I don't think quarantines help (they may even make things worse
> by slightly increasing the length of the race window, though in practice
> I doubt it).
>
>> Unless that interacts racily with the receive.unpackLimit, but then I
>> have no idea that section is trying to say...
>
> No, I don't think unpackLimit really affects it much either way.
>
>> Also, surely the part where "NOTES" says something to the effect of "you
>> are subject to races unless gc.auto=0" is wrong. To the extent that
>> there's races it won't matter that you invoke "git gc" or "git gc
>> --auto", it's the concurrency that matters. So if there's still races we
>> should be saying the repo needs to be locked for writes for the duration
>> of the "gc".
>
> Correct. It's the very act of pruning that is problematic. I think the
> point is that if you are manually running "git gc", you'd presumably do
> it at a time when the repository is not otherwise active.

Thanks for that E-Mail. I'm hoping to get around to another set of "gc
doc" updates and will hopefully be able to steal liberally from it.

Maybe there's some case I haven't thought of that makes this stupid, but
I wonder if something like a "gc quarantine" might be a fix fo both of
the the issues you noted above.

I.e. it seems to me that the main issue is that we conflate "mtime 2
weeks old because it's unreferenced for 2 weeks" v.s. "mtime 2 weeks old
because we haven't gotten around to a 'gc'".

So in such a "gc quarantine" mode when we discover an object/pack that's
unreachable/purely made up of unreachable objects we'd move the relevant
loose object/"loose" pack to such a quarantine, which would just be
.git/unreferenced-objects/{??,pack}/ or whatever.

AFAICT both cases you mentioned above would be mitigated by this because
we'd no longer conflate "haven't gc'd this yet and it's 2 weeks old"
v.s. "hasn't been referenced in 2 weeks".

I started looking at this initially because I was wondering if the
--keep-unreachable mode you modified in e26a8c4721 ("repack: extend
--keep-unreachable to loose objects", 2016-06-13) could be made to write
out such "unreferenced" objects into their *own* pack, so we could
delete them all at once as a batch, and wouldn't create the "ref
explosions" mentioned in [1].

But of course without an accompanying quarantine described above doing
that would just make this race condition worse.

1. https://public-inbox.org/git/87fu6bmr0j.fsf@xxxxxxxxxxxxxxxxxxx/