Re: [PATCH 0/4] gc docs: modernize and fix the documentation

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 31 Jul 2019 12:12:14 +0200

On Wed, Jul 31 2019, Jeff King wrote:

> On Fri, May 10, 2019 at 01:20:55AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > Michael Haggerty and I have (off-list) discussed variations on that, but
>> > it opens up a lot of new issues.  Moving something into quarantine isn't
>> > atomic. So you've still corrupted the repo, but now it's recoverable by
>> > reaching into the quarantine. Who notices that the repo is corrupt, and
>> > how? When do we expire objects from quarantine?
>> >
>> > I think the heart of the issue is really the lack of atomicity in the
>> > operations. You need some way to mark "I am using this now" in a way
>> > that cannot race with "looks like nobody is using this, so I'll delete
>> > it".
>> >
>> > And ideally without traversing large bits of the graph on the writing
>> > side, and without requiring any stop-the-world locks during pruning.
>>
>> I was thinking (but realize now that I didn't articulate) that the "gc
>> quarantine" would be another "alternate" implementing a copy-on-write
>> "lockless delete-but-be-able-to-rollback scheme" as you put it.
>>
>> So "gc" would decide (racily) what's unreachable, but instead of
>> unlink()-ing it would "mv" the loose object/pack into the
>> "unreferenced-objects" quarantine.
>>
>> Then in your example #1 "wants to reference ABCD. It sees that we have
>> it." would race on the "other side". I.e. maybe ABCD was *just* moved to
>> the quarantine. But in that case we'd move it back, which would bump the
>> mtime and thus make it ineligible for expiry.
>
> I think this is basically the same as the current freshening scheme,
> though. In general, you can replace "move it back" with "update its
> mtime". Neither is atomic with respect to other operations.
>
> It does seem like the twist is that "gc" is supposed to do the "move it
> back" step (and it's also the thing expiring, if we assume that there's
> only one gc running at a time). But again, how do we know somebody isn't
> referencing it _right now_ while we're deciding whether to move it back?

The twist is to create a "quarantine" area of the ref store you can't
read any objects from without copying them to the "main" area (git-gc
itself would be an exception).

Hence step #2 and #6, respectively, in your examples in
https://public-inbox.org/git/20190319001829.GL29661@xxxxxxxxxxxxxxxxxxxxx/
would have update-ref/receive-pack fail to find "ABCD" in the "main"
store due to the exact same race we have now with mtimes & gc, then fall
back to the "quarantine" and (this is the important part) immediately
copy it back to the "main" store.

IOW yes, you'd have the exact same race you have now with the initial
move to the quarantine. You'd have ref updates & gc racing and
"unreachable" things would be moved to the quarantine, but really the
just became reachable again.

The difference is that instead of unlinking that unreachable object we
move it to the quarantine, so the next "gc" (which is what would delete
it) would notice it's reachable and move it to the "main" area before
proceeding, *and* anything that "faults" back to reading the
"quarantine" would do the same.

> I think there are lots of solutions you can come up with if you have
> atomicity. But fundamentally it isn't there in the way we handle updates
> now. You could imagine something like a shared/unique lock where anybody
> updating a ref takes the "shared" side, and multiple entities can hold
> it at once. But somebody pruning takes the "unique" side and excludes
> everybody else, stopping ref updates during the prune (which you'd
> obviously want to do in a way that you hold the lock for as short as
> possible; say, optimistically check reachability without the lock, then
> take the lock and check to see if anything has changed).
>
> (By shared/unique I basically mean a reader/writer lock, but I didn't
> want to use those terms in the paragraph since both holders are
> writing).
>
> It is tricky to find out when to hold the shared lock, though. It's
> _not_ just a ref write, for example. When you accept a push, you'd want
> to hold the lock while you are checking that you have all of the
> necessary objects to write the ref. For something like "git commit" it's
> even harder, because we implicitly rely on state created by commands run
> over the course of hours or days (e.g., "git add" to put a blob in the
> index and maybe create the tree via cache-tree, then a commit to
> reference it, and finally the ref write; each step adds state which the
> next step relies on).

I don't think this sort of approach would require any global locks, but
it would be vulnerable to operations that take longer than the
"main->quarantine->unlink()" cycle takes. E.g. a "hash-object" that
takes a month before the subsequent "write-tree" etc.

All of the above written with the previously stated "I may be missing
something" caveat etc. :)