Re: GFS2 cluster node is running very slow

Steven Whitehouse <swhiteho@xxxxxxxxxx> · Thu, 31 Mar 2011 15:24:46 +0100

Hi,

On Thu, 2011-03-31 at 10:14 -0400, David Hill wrote:
> These directories are all on the same mount ... with a total size of 1.2TB!
> /mnt/gfs is the mount
> /mnt/gfs/scripts/appl01
> /mnt/gfs/scripts/appl02
> /mnt/gfs/scripts/appl03
> /mnt/gfs/scripts/appl04
> /mnt/gfs/scripts/appl05
> /mnt/gfs/scripts/appl06
> /mnt/gfs/scripts/appl07
> /mnt/gfs/scripts/appl08
> 
> All files accessed by the application are within it's own folder/subdirectory.
> No files is ever accessed by more than one node.
> 
> I'm going to suggest to split but this also bring another issue:
> 
> - We have a daily GFS lockout now...  We need to reboot the whole cluster to solve the issue.
> 
I'm not sure what you mean by that. What actually happens? Is it just
the filesystem that goes slow? Do you get any messages
in /var/log/messages do any nodes get fenced or does that fail too?

Steve.

> This is going bad.
> 
> -----Original Message-----
> From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Alan Brown
> Sent: 31 mars 2011 07:21
> To: linux clustering
> Subject: Re:  GFS2 cluster node is running very slow
> 
> David Hill wrote:
> > Hi Steve,
> > 
> > 	We seems to be experiencing some new issues now... With 4 nodes, only one is slow but with 3 nodes, 2 of them are now slow.
> > 2 nodes are doing 20k/s and one is doing 2mb/s ...  Seems like all nodes will end up with poor performances.
> > All nodes are locking files in their own directory /mnt/application/tomcat-1, /mnt/application/tomcat-2 ...
> 
> Just to clarify:
> 
> Are these directories on the same filesystem or are they on individual 
> filesystems?
> 
> If the former, try splitting into separate filesystems.
> 
> Remember that one node will become the filesystem master and everything 
> else will be slower when accessing that filesystem.
> 
> > I'm out of ideas on this one.
> > 
> > Dave
> > 
> > 
> > 
> > -----Original Message-----
> > From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of David Hill
> > Sent: 30 mars 2011 11:42
> > To: linux clustering
> > Subject: Re:  GFS2 cluster node is running very slow
> > 
> > Hi Steve,
> > 
> > 	I think you're right about the the glock ... There was MANY more of these.  
> > We're using a new server with totally different hardware.  We've done many test 
> > before posting to the mailing list like:
> > - copy files from the problematic node to the other nodes without using the problematic mount, everything is fine (7MB/s)
> > - read from the problematic mount on the "broken" node is fine too (21MB/s)
> > So, at this point, I doubt the problem is the network infrastructure behind the node (or the network adapter) because everything is going smooth on all aspect BUT
> > we cannot use the /mnt on the broken node because it's not usable.  Last time I tried to copy a file to that /mnt it was doing 5k/s while
> > all the other nodes are doing ok at 7MB/s ...
> > 
> > Whenever we do the test, it doesn't seem to go higher than 200k/s ...
> > 
> > But still, we can transfer to all nodes at a decent speed from that host.
> > We can transfer to the SAN at a decent speed.
> > 
> > CPU is 0% used.
> > Memory is 50% used.
> > Network is 0% used.
> > 
> > Only difference between that host and the others is that the mysql database is hosted locally and storage is on the same SAN ... but even with this,
> > Mysqld is using only 2mbit/s on the loopback, a little bit of memory and mostly NO CPU .
> > 
> > 
> > Here is a capture of the system:
> > top - 15:39:51 up  7:40,  1 user,  load average: 0.08, 0.13, 0.11
> > Tasks: 343 total,   1 running, 342 sleeping,   0 stopped,   0 zombie
> > Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu1  :  0.1%us,  0.0%sy,  0.0%ni, 99.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu2  :  0.1%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu3  :  0.2%us,  0.0%sy,  0.0%ni, 99.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu5  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu6  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu7  :  0.1%us,  0.0%sy,  0.0%ni, 99.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu8  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu9  :  0.1%us,  0.0%sy,  0.0%ni, 99.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu10 :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu13 :  0.2%us,  0.0%sy,  0.0%ni, 99.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu14 :  0.1%us,  0.1%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu15 :  0.4%us,  0.1%sy,  0.0%ni, 99.4%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu16 :  0.1%us,  0.0%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu17 :  0.4%us,  0.1%sy,  0.0%ni, 99.4%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu18 :  0.2%us,  0.0%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu19 :  0.6%us,  0.1%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu20 :  0.2%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu21 :  0.6%us,  0.1%sy,  0.0%ni, 99.2%id,  0.1%wa,  0.0%hi,  0.1%si,  0.0%st
> > Cpu22 :  0.2%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > Cpu23 :  0.1%us,  0.0%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
> > Mem:  32952896k total,  2453956k used, 30498940k free,   256648k buffers
> > Swap:  4095992k total,        0k used,  4095992k free,   684160k cached
> > 
> > 
> > It's a monster for what it does.  Could it be possible that it's soo much more performant than the other nodes that it kills itself?  
> > 
> > The servers is Centos 5.5 .
> > The filesystem if 98% full (31G remaining on 1.2T) ... but if that is an issue, why does all other nodes running smoothly and having no issues but that one?
> > 
> > 
> > Thank you for the reply,
> > 
> > Dave
> > 
> > 
> > 
> > -----Original Message-----
> > From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Steven Whitehouse
> > Sent: 30 mars 2011 07:48
> > To: linux clustering
> > Subject: Re:  GFS2 cluster node is running very slow
> > 
> > Hi,
> > 
> > On Wed, 2011-03-30 at 01:34 -0400, David Hill wrote:
> >> Hi guys,
> >>
> >>  
> >>
> >> Iâve found this in /sys/kernel/debug/gfs2/fsname/glocks
> >>
> >>  
> >>
> >> H: s:EX f:tW e:0 p:22591 [jsvc] gfs2_inplace_reserve_i+0x451/0x69a
> >> [gfs2]
> >>
> >> H: s:EX f:tW e:0 p:22591 [jsvc] gfs2_inplace_reserve_i+0x451/0x69a
> >> [gfs2]
> >>
> >> H: s:EX f:W e:0 p:806 [pdflush] gfs2_write_inode+0x57/0x152 [gfs2]
> >>
> > This doesn't mean anything without a bit more context. Were these all
> > queued against the same glock? If so which glock was it?
> > 
> >>  
> >>
> >> The application running is confluence and has 184 thread.   The other
> >> nodes work fine but that specific node is having issues obtaining
> >> locks when itâs time to write?
> >>
> > That does sound a bit strange. Are you using a different network card on
> > the slow node? Have you checked to see if there is too much traffic on
> > that network link?
> > 
> > Also, how full was the filesystem and which version of GFS2 are you
> > using (i.e. RHELx, Fedora X or CentOS or....)?
> > 
> > 
> > Steve.
> > 
> >>  
> >>
> >> Dave
> >>
> >>  
> >>
> >>  
> >>
> >>  
> >>
> >>  
> >>
> >> From: linux-cluster-bounces@xxxxxxxxxx
> >> [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of David Hill
> >> Sent: 29 mars 2011 21:00
> >> To: linux-cluster@xxxxxxxxxx
> >> Subject:  GFS2 cluster node is running very slow
> >>
> >>
> >>  
> >>
> >> Hi guys,
> >>
> >>  
> >>
> >>                 We have a GFS2 cluster consisting of 3 nodes.  At this
> >> point, everything is going smooth.  Now, we add a new node with more
> >> CPUs with the
> >>
> >> exact same configuration but all transactions on the mount run very
> >> slow.
> >>
> >>  
> >>
> >> Copying a file to the mount is done at about 25kb/s when on the three
> >> other nodes, everything goes smooth at about 7MB/s.
> >>
> >> CPU on all nodes is idling at some point, all cluster process are kind
> >> of sleeping. 
> >>
> >>  
> >>
> >> Weâve tried the ping_pong.c from apache and it seems to be able to
> >> write/read lock files at a decent rate.
> >>
> >>  
> >>
> >> Thereâs other mounts on the system using the same fc
> >> card/fibers/switches/san and all these are also working at a decent
> >> speed...
> >>
> >>  
> >>
> >> Iâve been reading a good part of the day, and I canât seem to find a
> >> solution.
> >>
> >>  
> >>
> >>  
> >>
> >>  
> >>
> >>  
> >>
> >> ubisoft_logo
> >>
> >> David C. Hill
> >>
> >> Linux System Administrator - Enterprise
> >>
> >> 514-490-2000#5655
> >>
> >> http://www.ubi.com
> >>
> >>  
> >>
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster@xxxxxxxxxx
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster@xxxxxxxxxx
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster@xxxxxxxxxx
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster@xxxxxxxxxx
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster