Re: Re: write's pausing - which tools to debug?

Troy Dawson <dawson@xxxxxxxx> · Fri, 21 Oct 2005 08:18:28 -0500

Axel Thimm wrote:
Hi,

On Tue, Oct 18, 2005 at 09:20:14AM -0500, Troy Dawson wrote:

We've been having some problems with doing a write's to our GFS file 
system, and it will pause, for long periods.  (Like from 5 to 10 
seconds, to 30 seconds, and occasially 5 minutes)  After the pause, it's 
like nothing happened, whatever the process is, just keeps going happy 
as can be.
Except for these pauses, our GFS is quite zippy, both reads and writes. 
But these pauses are holding us back from going full production.
I need to know what tools I should use to figure out what is causing 
these pauses.

Here is the setup.
All machines: RHEL 4 update 1 (ok, actually S.L. 4.1), kernel 
2.6.9-11.ELsmp, GFS 6.1.0, ccs 1.0.0, gulm 1.0.0, rgmanager 1.9.34

I have no ability to do fencing yet, so I chose to use the gulm locking 
mechanism.  I have it setup so that there are 3 lock servers, for 
failover.  I have tested the failover, and it works quite well.

If this is a testing environment use manual fencing. E.g. if a node
needs to get fenced you get a log message saying that you should do
that and acknowledge that.

I have 5 machines in the cluster.  1 isn't connected to the SAN, or 
using GFS.  It is just a failover gulm lock server incase the other two 
lock servers go down.

So I have 4 machines connected to our SAN and using GFS.  3 are 
read-only, 1 is read-write.  If it is important, the 3 read-only are 
x86_64, the 1 read-write and the 1 not connected are i386.

The read/write machine is our master lock server.  Then one of the 
read-only is a fallback lock server, as is the machine not using GFS.

Anyway, we're getting these pauses when writting, and I'm having a hard 
time tracking down where the problem is.  I *think* that we can still 
read from the other machines.  But since this comes and goes, I haven't 
been able to verify that.

What SAN hardware is attached to the nodes?

From the switch on down, I don't know.  It's a centrally managed SAN, 
that I have been allowed to plug into and given disk space.  I do have 
Qlogic cards in the machines.

Anyway, which tools do you think would be best in diagnosing this?

I'd suggest to check/monitor networking. Also place the cluster
communication on a separate network that the SAN/LAN network. The
cluster heartbeat goes over UDP and a congested network may delay
these packages or drop the completely. At least that's the CMAN
picture, lock_gulm may be different.

That sounds like a good idea.  All of our machines have two ethernet 
ports, and I'm not using the second one on any of them.  That would 
actually fix some other problems as well.

Also don't mix RHELU1 and U2 or FC<N>. Just in case you'd like to
upgrade to SL4.2 one by one.

Yup, read that, but thanks for the reminder.

There have been many changes/bug fixes to the cluster bits in RHELU2,
and there are also some new spiffy features like multipath. Perhaps
it's worth rebasing your testing environment?

Don't I wish it was a testing enviroment.  But at least the machines 
don't HAVE to be 24x7.  And I've only got one of them in production 
right now, so it's only one going down.

Troy
--
__________________________________________________
Troy Dawson  dawson@xxxxxxxx  (630)840-6468
Fermilab  ComputingDivision/CSS  CSI Group
__________________________________________________

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster