Re: missing objects -- prevention

Sitaram Chamarty <sitaramc@xxxxxxxxx> · Sun, 13 Jan 2013 06:26:53 +0530

On Sat, Jan 12, 2013 at 6:43 PM, Jeff King <peff@xxxxxxxx> wrote:
> On Sat, Jan 12, 2013 at 06:39:52AM +0530, Sitaram Chamarty wrote:
>
>> >   1. The repo has a ref R pointing at commit X.
>> >
>> >   2. A user starts a push to another ref, Q, of commit Y that builds on
>> >      X. Git advertises ref R, so the sender knows they do not need to
>> >      send X, but only Y. The user then proceeds to send the packfile
>> >      (which might take a very long time).
>> >
>> >   3. Meanwhile, another user deletes ref R. X becomes unreferenced.
>>
>> The gitolite logs show that no deletion of refs has happened.
>
> To be pedantic, step 3 could also be rewinding R to a commit before X.
> Anything that causes X to become unreferenced.

Right, but there were no rewinds also; I should have mentioned that.
(Gitolite log files mark rewinds and deletes specially, so they're
easy to search.  There were two attempted rewinds but they failed the
gitolite update hook so -- while the new objects would have landed in
the object store -- the old ones were not dereferenced).

>> > There is a race with simultaneously deleting and packing refs. It
>> > doesn't cause object db corruption, but it will cause refs to "rewind"
>> > back to their packed versions. I have seen that one in practice (though
>> > relatively rare). I fixed it in b3f1280, which is not yet in any
>> > released version.
>>
>> This is for the packed-refs file right?  And it could result in a ref
>> getting deleted right?
>
> Yes, if the ref was not previously packed, it could result in the ref
> being deleted entirely.
>
>> I said above that the gitolite logs say no ref was deleted.  What if
>> the ref "deletion" happened because of this race, making the rest of
>> your 4-step scenario above possible?
>
> It's possible. I do want to highlight how unlikely it is, though.

Agreed.

>> > up in the middle, or fsck rejects the pack). We have historically left
>>
>> fsck... you mean if I had 'receive.fsckObjects' true, right?  I don't.
>>  Should I?  Would it help this overall situation?  As I understand it,
>> thats only about the internals of each object to check corruption, and
>> cannot detect a *missing* object on the local object store.
>
> Right, I meant if you have receive.fsckObjects on. It won't help this
> situation at all, as we already do a connectivity check separate from
> the fsck. But I do recommend it in general, just because it helps catch
> bad objects before they gets disseminated to a wider audience (at which
> point it is often infeasible to rewind history). And it has found git
> bugs (e.g., null sha1s in tree entries).

I will add this.  Any idea if there's a significant performance hit?

>> > At GitHub, we've taken to just cleaning them up aggressively (I think
>> > after an hour), though I am tempted to put in an optional signal/atexit
>>
>> OK; I'll do the same then.  I suppose a cron job is the best way; I
>> didn't find any config for expiring these files.
>
> If you run "git prune --expire=1.hour.ago", it should prune stale
> tmp_pack_* files more than an hour old. But you may not be comfortable
> with such a short expiration for the objects themselves. :)
>
>> Thanks again for your help.  I'm going to treat it (for now) as a
>> disk/fs error after hearing from you about the other possibility I
>> mentioned above, although I find it hard to believe one repo can be
>> hit buy *two* races occurring together!
>
> Yeah, the race seems pretty unlikely (though it could be just the one
> race with a rewind). As I said, I haven't actually ever seen it in
> practice. In my experience, though, disk/fs issues do not manifest as
> just missing objects, but as corrupted packfiles (e.g., the packfile
> directory entry ends up pointing to the wrong inode, which is easy to
> see because the inode's content is actually a reflog). And then of
> course with the packfile unreadable, you have missing objects. But YMMV,
> depending on the fs and what's happened to the machine to cause the fs
> problem.

That's always the hard part.  System admins (at the Unix level) insist
there's nothing wrong and no disk errors and so on...  that is why I
was interested in network errors causing problems and so on.

Anyway, now that I know the tmp_pack_* files are caused mostly by
failed pushes than by failed auto-gc, at least I can deal with the
immediate problem easily!

Thanks once again for your patient replies!

sitaram

-- 
Sitaram
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html