On Sat, Jan 12, 2013 at 6:43 PM, Jeff King <peff@xxxxxxxx> wrote: > On Sat, Jan 12, 2013 at 06:39:52AM +0530, Sitaram Chamarty wrote: > >> > 1. The repo has a ref R pointing at commit X. >> > >> > 2. A user starts a push to another ref, Q, of commit Y that builds on >> > X. Git advertises ref R, so the sender knows they do not need to >> > send X, but only Y. The user then proceeds to send the packfile >> > (which might take a very long time). >> > >> > 3. Meanwhile, another user deletes ref R. X becomes unreferenced. >> >> The gitolite logs show that no deletion of refs has happened. > > To be pedantic, step 3 could also be rewinding R to a commit before X. > Anything that causes X to become unreferenced. Right, but there were no rewinds also; I should have mentioned that. (Gitolite log files mark rewinds and deletes specially, so they're easy to search. There were two attempted rewinds but they failed the gitolite update hook so -- while the new objects would have landed in the object store -- the old ones were not dereferenced). >> > There is a race with simultaneously deleting and packing refs. It >> > doesn't cause object db corruption, but it will cause refs to "rewind" >> > back to their packed versions. I have seen that one in practice (though >> > relatively rare). I fixed it in b3f1280, which is not yet in any >> > released version. >> >> This is for the packed-refs file right? And it could result in a ref >> getting deleted right? > > Yes, if the ref was not previously packed, it could result in the ref > being deleted entirely. > >> I said above that the gitolite logs say no ref was deleted. What if >> the ref "deletion" happened because of this race, making the rest of >> your 4-step scenario above possible? > > It's possible. I do want to highlight how unlikely it is, though. Agreed. >> > up in the middle, or fsck rejects the pack). We have historically left >> >> fsck... you mean if I had 'receive.fsckObjects' true, right? I don't. >> Should I? Would it help this overall situation? As I understand it, >> thats only about the internals of each object to check corruption, and >> cannot detect a *missing* object on the local object store. > > Right, I meant if you have receive.fsckObjects on. It won't help this > situation at all, as we already do a connectivity check separate from > the fsck. But I do recommend it in general, just because it helps catch > bad objects before they gets disseminated to a wider audience (at which > point it is often infeasible to rewind history). And it has found git > bugs (e.g., null sha1s in tree entries). I will add this. Any idea if there's a significant performance hit? >> > At GitHub, we've taken to just cleaning them up aggressively (I think >> > after an hour), though I am tempted to put in an optional signal/atexit >> >> OK; I'll do the same then. I suppose a cron job is the best way; I >> didn't find any config for expiring these files. > > If you run "git prune --expire=1.hour.ago", it should prune stale > tmp_pack_* files more than an hour old. But you may not be comfortable > with such a short expiration for the objects themselves. :) > >> Thanks again for your help. I'm going to treat it (for now) as a >> disk/fs error after hearing from you about the other possibility I >> mentioned above, although I find it hard to believe one repo can be >> hit buy *two* races occurring together! > > Yeah, the race seems pretty unlikely (though it could be just the one > race with a rewind). As I said, I haven't actually ever seen it in > practice. In my experience, though, disk/fs issues do not manifest as > just missing objects, but as corrupted packfiles (e.g., the packfile > directory entry ends up pointing to the wrong inode, which is easy to > see because the inode's content is actually a reflog). And then of > course with the packfile unreadable, you have missing objects. But YMMV, > depending on the fs and what's happened to the machine to cause the fs > problem. That's always the hard part. System admins (at the Unix level) insist there's nothing wrong and no disk errors and so on... that is why I was interested in network errors causing problems and so on. Anyway, now that I know the tmp_pack_* files are caused mostly by failed pushes than by failed auto-gc, at least I can deal with the immediate problem easily! Thanks once again for your patient replies! sitaram -- Sitaram -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html