I've seen the same behavior with slocate.cron doing it's thing. I had to add gfs to it's filesystems exclude list. My setup is as follows ... 3 HP-DL380-G3's 1 MSA1000 6 FC2214 (QL2340) FC cards The three nodes are set up as lock managers and a 1TB fs created. When populating the file system using scp, or rsync etc from another machine with approx 400GB worth of 50k files, the target machine would become unresponsive. This lead to me moving to the latest version available at the time (6.0.2.20-1) and setting up alternate nics for lock_gulmd to use, which seems to have help tremendously. That said, after a the first successful complete data transfer on this cluster I went to do a 'du -sh' on the mount point and the machine got to a state where it would refuse to fork. Which is exactly what the problem was with slocate.cron. I've not upgraded the kernel yet (I am running 2.4.21-27.0.1.Elsmp). I'd like to stay with this version but will upgrade if required. -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Rich Paredes Sent: Wednesday, June 08, 2005 10:43 PM To: mtilstra@xxxxxxxxxx; linux clustering Subject: Re: Timeout causing GFS filesystem inaccessibility I found out that updatedb was running on both nodes at 4:02 am, right before problems were occurring. It was indexing gfs filesystem since gfs was not listed as excluded filesystem. Could this explain the errors? On 6/6/05, Michael Conrad Tadpol Tilstra <mtilstra@xxxxxxxxxx> wrote: > On Fri, Jun 03, 2005 at 09:48:12PM -0400, Rich Paredes wrote: > > Assumptions: 3 node cluster. > > All 3 nodes are lock managers > > Nodes 1 and 2 mount GFS filesystems > > Node 1 during failure is master, node 2 and node 3 are slaves > > > > Error on node 2 is: > > lock_gulmd_LT000[3608]: Timeout (15000000) on idx: 2 fd:7 > > (node1:192.168.101.11) > > > > This error keeps repeating in the logs and GFS filesystem are > > totally inaccessible. To fix, the master lock manager needs to be > > manually expired and then rebooted because applications were > > accessing GFS filesystems. > > > > It looks like error message is generated from lock_io.c. > > > > Does anyone know exactly what causes this error? > > New sockets have a sepcific time slot in which they must send a valid > login packet before they are kicked out. The message you're seeing is > form this. There should be a metching set of messages from node1 > saying it is trying to log into node2. (the message might be supressed though. > You will probably need to add the LoginLoops to the verbosity > setting.) > > That error message should provide some clues as to why the timeouts > are happening. > > -- > Michael Conrad Tadpol Tilstra > For some inexplicable reason, you just wish it would rain. > > > -- > > Linux-cluster@xxxxxxxxxx > http://www.redhat.com/mailman/listinfo/linux-cluster > > > -- Linux-cluster@xxxxxxxxxx http://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster@xxxxxxxxxx http://www.redhat.com/mailman/listinfo/linux-cluster