Re: Consolidate SHA1 object file close

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 11 Jun 2008 08:17:04 -0700 (PDT)

On Wed, 11 Jun 2008, Pierre Habouzit wrote:
> 
>   Could this be the source of a problem we often meet at work ? Let me
> try to describe it.

The fsync() *should* make no difference unless you actually crash. So my 
initial reaction is no, but non-coherent client-side write caching over 
NFS may actually make a difference.

>   We work with our git repositories (storages I should say) on NFS
> homes, with workdirs on a local directory (NFS homes are backuped daily,
> hence everything commited get backuped, and developers have shorter
> compilation times thanks to the local FS).

Ok, so your actual git object directory is on NFS?

>   Quite often, when people commit, they have corrupt repositories. The
> symptom is a `cannot read <sha1>` error message (or many at times). The
> usual way to "fix" it is to git fsck, and git reset (because after the
> fsck the index is totally screwed and all local files are marked new),
> and usually everything is fine then.

Hmm. Very interesting. That definitely sounds like a cache coherency 
issue (ie the "fsck" probably doesn't really _do_ anything, it just 
delays things and possibly causes memory pressure to throw some stuff out 
of the cache).

What clients, what server?

NFS clients (I assume v2, which is not coherent) _should_ be doing what is 
called open-close consistent, which means that while clients can cache 
data locally, they should aim to be consistent between two clients over a 
an open-close pair (ie if two clients have the same file open at the same 
time, there are no consistency guarantees, but if you close on one client 
and then open on another, the data should be consistent).

If open-close consistency doesn't work, then things like various parallel 
load distribution things (clusters with a NFS filesystem doing parallel 
makes, etc) don't tend to work all that well either (ie an object file is 
written on one client, and then used for linking on another).

And that is what git does: even without the fsync(), git will "close()" 
the file before it actually does the link + unlink to move it to the new 
position. So it all _should_ be perfectly consistent even in the absense 
of explicit syncs.

That said, if there is some problem with that whole thing, then yes, the 
fsync() may well hide it. So yes, adding the fsync() is certainly worth 
testing.

>   This is not really a hard corruption, and it's really hard to
> reproduce, I don't know why it happens, and I wonder if this patch could
> help, or if it's unrelated. I can only bring speculations as it's really
> hard to reproduce, and it quite depends on the load of the NFS server :/

Yes, that sounds very much like a cache coherency issue. The "corruption" 
goes away when the cache gets flushed and the clients see the real state 
again. But as mentioned, git should already do things in a way that this 
should all work, but hey, that's using certain assumptions that perhaps 
aren't true in your environment.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html