Re: Consolidate SHA1 object file close

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 12 Jun 2008 08:33:53 -0700 (PDT)

On Thu, 12 Jun 2008, Pierre Habouzit wrote:
> 
>   No, we're not using a shared git object repository, each developper
> has a git checkout in his /home (on NFS) but works for real in a workdir
> that lives on his local hard drive (to get faster compilation times,
> because NFS really sucks at speed for compilation). Though, people
> working on plain NFS have had the same problems.

Ahhh..

In that case it's not going to be a client caching issue - at least not in 
the sense that two different clients are out-of-sync with each other wrt 
caches. It sounds as it you only ever have one client that reads and 
writes to the same git repository at a time.

So scratch all the previous theory.

Quite frankly, in that case, it sounds more like simply some NFS problem. 
And we _have_ had NFS problems before. See the threads

 - bug: git-repack -a -d produces broken pack on NFS

   Turned out to apparently be ethernet packet corruption that was not 
   detected by the hardware and was due to a badly seated ethernet card!

 - git 1.5.3.5 error over NFS

   Some unexplained corruption due to problms with pread() on NFS not 
   returning data that was previously written.

for example.

Basically, NFS has many serious failure cases that can go undetected, and 
it _could_ be that you actually have flaky NFS but never noticed it before 
because most tools don't care as deeply as git does (ie if a bit is 
flipped in some random data, a lot of tools will never notice). There are 
supposed to be checksums etc on the network packets that NFS uses, but:

 - the ethernet checksum (which is a fairly strong CRC) is sadly often not 
   even checked by some switches and/or cards, and especially if it's a 
   store-and-forward switch that doesn't check the CRC properly, it can 
   end up re-sending a corrupt packet with a recomputed ethernet CRC that 
   now matches the _corrupt_ data. Oops.

 - Perhaps worse, the ethernet checksum is purely a physical layer one, 
   not an end-to-end checksum, which not only explains how a switch can 
   re-generate a broken one, but also means that even if the ethernet card 
   checks it properly, it doesn't actually account for any corruption that 
   happens _afterwards_. So if there is corruption going from the card to 
   memory (which was apparently the problem in the first git thread 
   above), the CRC got checked earlier and the new corruption isn't found.

 - there _is_ an TCP/IP-level packet check, with a checksum of the IP 
   header, and a separate checksum of UDP and TCP data. HOWEVER. All these 
   checksums are very very weak, and to make things worse, the UDP 
   checksum can be entirely disabled, and quite often "better" ethernet 
   cards will do checksumming for you in hardware, which again means that 
   it's not an end-to-end checksum, and you have the exact same failure 
   case as with the ethernet CRC.

IOW, there are safety nets in place, but they tend to be fairly easily 
broken under certain circumstances.

Add to the above the possibility of just a kernel NFS bug (or a NFSd one), 
and it would really be very interesting to hear:

 - do the errors seem to happen more at certain clients than others?

   If it's a client-side problem, it really should happen more for certain 
   kernel versions or certain hardware.

 - have you had any other anecdotal evidence of problems with non-git 
   usage? Unexplained SIGSEGV's if you have binaries over NFS, for 
   example? Strange syntax errors when compiling over NFS?

I'm not discounting a git bug, but quite frankly, it really is worth 
checking that your network/NFS setup is solid.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html