On Fri, Jan 11, 2013 at 04:40:38PM +0530, Sitaram Chamarty wrote: > I find a lot of info on how to recover from and/or repair a repo that > has missing (or corrupted) objects. > > What I need is info on common reasons (other than disk errors -- we've > checked for those) for such errors to occur, any preventive measures > we can take, and so on. I don't think any race can cause corruption of the object or packfiles because of the way they are written. At GitHub, every case of file-level corruption we've seen has been a filesystem issue. So I think the main thing systemic/race issue to worry about is missing objects. And since git only deletes objects during a prune (assuming you are using git-gc or "repack -A" so that repack cannot drop objects), I think prune is the only thing to watch out for. The --expire time saves us from the obvious races where you write object X but have not yet referenced it, and a simultaneous prune wants to delete it. However, it's possible that you have an old object that is unreferenced, but would become referenced as a result of an in-progress operation. For example, commit X is unreferenced and ready to be pruned, you build commit Y on top of it, but before you write the ref, git-prune removes X. The server-side version of that would happen via receive-pack, and is even more unlikely, because X would have to be referenced initially for us to advertise it. So it's something like: 1. The repo has a ref R pointing at commit X. 2. A user starts a push to another ref, Q, of commit Y that builds on X. Git advertises ref R, so the sender knows they do not need to send X, but only Y. The user then proceeds to send the packfile (which might take a very long time). 3. Meanwhile, another user deletes ref R. X becomes unreferenced. 4. After step 3 but before step 2 has finished, somebody runs prune (this might sound unlikely, but if you kick off a "gc" job after each push, or after N pushes, it's not so unlikely). It sees that X is unreferenced, and it may very well be older than the --expire setting. Prune deletes X. 5. The packfile in (2) arrives, and receive-pack attempts to update the refs. So it's even a bit more unlikely than the local case, because receive-pack would not otherwise build on dangling objects. You have to race steps (2) and (3) just to create the situation. Then we have an extra protection in the form of check_everything_connected, which receive-pack runs before writing the refs into place. So if step 4 happens while the packfile is being sent (which is the most likely case, since it is the longest stretch of receive-pack's time), we would still catch it there and reject the push (annoying to the user, but the repo remains consistent). However, that's not foolproof. We might hit step 4 after we've checked that everything is connected but right before we write the ref. In which case we drop X, which has just become referenced, and we have a missing object. So I think it's possible. But I have never actually seen it in practice, and come up with this scenario only by brainstorming "what could go wrong" scenarios. This could be mitigated if there was a "proposed refs" storage. Receive-pack would write a note saying "consider Y for pruning purposes, but it's not really referenced yet", check connectivity for Y against the current refs, and then eventually write Y to its real ref (or reject it if there are problems). Prune would either run before the "proposed" note is written, which would mean it deletes X, but the connectivity check fails. Or it would run after, in which case it would leave X alone. > For example, can *any* type of network error or race condition cause > this? (Say, can one push writes an object, then fails an update > check, and a later push succeeds and races against a gc that removes > the unreachable object?) Or... the repo is pretty large -- about 6-7 > GB, so could size cause a race that would not show up on a smaller > repo? The above is the only open issue I know about. I don't think it is dependent on repo size, but the window is widened for a really large push, because rev-list takes longer to run. It does not widen if you have receive.fsckobjects set, because that happens before we do the connectivity check (and the connectivity check is run in a sub-process, so the race timer starts when we exec rev-list, which may open and mmap packfiles or otherwise cache the presence of X in memory). > Anything else I can watch out for or caution the team about? That's the only open issue I know about for missing objects. There is a race with simultaneously deleting and packing refs. It doesn't cause object db corruption, but it will cause refs to "rewind" back to their packed versions. I have seen that one in practice (though relatively rare). I fixed it in b3f1280, which is not yet in any released version. > The symptom is usually a disk space crunch caused by tmp_pack_* files > left around by auto-gc. Presumably the missing objects failed the gc > and so it left the files around, and they eventually accumulate enough > to cause disk full errors. (If a gc ever succeeded, I suspect these > files would go away, but that requires manual intervention). Gc shouldn't be leaving around tmp_pack_* unless it is actually dying during the pack-objects phase. In my experience, stale tmp_pack_* objects are more likely a sign of a failed push (e.g., the client hangs up in the middle, or fsck rejects the pack). We have historically left them in place to facilitate analysis of the failure. At GitHub, we've taken to just cleaning them up aggressively (I think after an hour), though I am tempted to put in an optional signal/atexit handler to clean them up when index-pack fails. The forensics are occasionally interesting (e.g., finding weird fsck problems), but large pushes can waste a lot of disk space in the interim. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html