RE: Timeout causing GFS filesystem inaccessibility

"Kovacs, Corey J." <cjk@xxxxxxxxxx> · Thu, 9 Jun 2005 08:33:34 -0400

I've seen the same behavior with slocate.cron doing it's thing. I had to add
gfs to it's filesystems exclude list. My setup is as follows ...

3 HP-DL380-G3's
1 MSA1000
6 FC2214 (QL2340) FC cards

The three nodes are set up as lock managers and a 1TB fs created. When
populating
the file system using scp, or rsync etc from another machine with approx
400GB worth
of 50k files, the target machine would become unresponsive. This lead to me
moving to 
the latest version available at the time (6.0.2.20-1) and setting up
alternate nics 
for lock_gulmd to use, which seems to have help tremendously.

That said, after a the first successful complete data transfer on this
cluster I went
to do a 'du -sh' on the mount point and the machine got to a state where it
would
refuse to fork. Which is exactly what the problem was with slocate.cron.

I've not upgraded the kernel yet (I am running 2.4.21-27.0.1.Elsmp). I'd like
to stay
with this version but will upgrade if required.

-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx
[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Rich Paredes
Sent: Wednesday, June 08, 2005 10:43 PM
To: mtilstra@xxxxxxxxxx; linux clustering
Subject: Re:  Timeout causing GFS filesystem inaccessibility

I found out that updatedb was running on both nodes at 4:02 am, right before
problems were occurring.  It was indexing gfs filesystem since gfs was not
listed as excluded filesystem.  Could this explain the errors?

On 6/6/05, Michael Conrad Tadpol Tilstra <mtilstra@xxxxxxxxxx> wrote:
> On Fri, Jun 03, 2005 at 09:48:12PM -0400, Rich Paredes wrote:
> > Assumptions: 3 node cluster.
> > All 3 nodes are lock managers
> > Nodes 1 and 2 mount GFS filesystems
> > Node 1 during failure is master, node 2 and node 3 are slaves
> >
> > Error on node 2 is:
> > lock_gulmd_LT000[3608]: Timeout (15000000) on idx: 2 fd:7 
> > (node1:192.168.101.11)
> >
> > This error keeps repeating in the logs and GFS filesystem are 
> > totally inaccessible.  To fix, the master lock manager needs to be 
> > manually expired and then rebooted because applications were 
> > accessing GFS filesystems.
> >
> > It looks like error message is generated from lock_io.c.
> >
> > Does anyone know exactly what causes this error?
> 
> New sockets have a sepcific time slot in which they must send a valid 
> login packet before they are kicked out.  The message you're seeing is 
> form this.  There should be a metching set of messages from node1 
> saying it is trying to log into node2.  (the message might be supressed
though.
> You will probably need to add the LoginLoops to the verbosity 
> setting.)
> 
> That error message should provide some clues as to why the timeouts 
> are happening.
> 
> --
> Michael Conrad Tadpol Tilstra
> For some inexplicable reason, you just wish it would rain.
> 
> 
> --
> 
> Linux-cluster@xxxxxxxxxx
> http://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
>

--

Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster

--

Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster