On Wed, Jun 11, 2008 at 03:17:04PM +0000, Linus Torvalds wrote: > > > On Wed, 11 Jun 2008, Pierre Habouzit wrote: > > > > Could this be the source of a problem we often meet at work ? Let me > > try to describe it. > > The fsync() *should* make no difference unless you actually crash. So my > initial reaction is no, but non-coherent client-side write caching over > NFS may actually make a difference. That's what I thought as well but … one never knows ;) > > We work with our git repositories (storages I should say) on NFS > > homes, with workdirs on a local directory (NFS homes are backuped daily, > > hence everything commited get backuped, and developers have shorter > > compilation times thanks to the local FS). > > Ok, so your actual git object directory is on NFS? Yes. > > Quite often, when people commit, they have corrupt repositories. The > > symptom is a `cannot read <sha1>` error message (or many at times). The > > usual way to "fix" it is to git fsck, and git reset (because after the > > fsck the index is totally screwed and all local files are marked new), > > and usually everything is fine then. > > Hmm. Very interesting. That definitely sounds like a cache coherency > issue (ie the "fsck" probably doesn't really _do_ anything, it just > delays things and possibly causes memory pressure to throw some stuff out > of the cache). > > What clients, what server? Server uses NFSv3 kernel server from Debian's 2.6.18 etch (up to date). Clients are various Unbuntu/Debian's with at least 2.6.18 kernels, some .22 .24 and .25. It's a really simple setup, no clusters are involved. The server exports an ext3 over dm-crypt partition though, but I would be surprised it matters. > That said, if there is some problem with that whole thing, then yes, the > fsync() may well hide it. So yes, adding the fsync() is certainly worth > testing. Okay, I'll try to make my colleagues use that to see if they still have the issues. I work on a laptop and not NFS, so I'm not the one having the issues, only the one having to fix them on other's machines ;P > > This is not really a hard corruption, and it's really hard to > > reproduce, I don't know why it happens, and I wonder if this patch could > > help, or if it's unrelated. I can only bring speculations as it's really > > hard to reproduce, and it quite depends on the load of the NFS server :/ > > Yes, that sounds very much like a cache coherency issue. The "corruption" > goes away when the cache gets flushed and the clients see the real state > again. But as mentioned, git should already do things in a way that this > should all work, but hey, that's using certain assumptions that perhaps > aren't true in your environment. Well we have the issue for quite a long time actually, and given that it's hard to reproduce, I'm never in a state to be able to give more useful informations :/ We'll see if the fsync() helps or not… -- ·O· Pierre Habouzit ··O madcoder@xxxxxxxxxx OOO http://www.madism.org
Attachment:
pgpfBD9QCoBFD.pgp
Description: PGP signature