Hi, On Tue, Oct 18, 2005 at 09:20:14AM -0500, Troy Dawson wrote: > We've been having some problems with doing a write's to our GFS file > system, and it will pause, for long periods. (Like from 5 to 10 > seconds, to 30 seconds, and occasially 5 minutes) After the pause, it's > like nothing happened, whatever the process is, just keeps going happy > as can be. > Except for these pauses, our GFS is quite zippy, both reads and writes. > But these pauses are holding us back from going full production. > I need to know what tools I should use to figure out what is causing > these pauses. > > Here is the setup. > All machines: RHEL 4 update 1 (ok, actually S.L. 4.1), kernel > 2.6.9-11.ELsmp, GFS 6.1.0, ccs 1.0.0, gulm 1.0.0, rgmanager 1.9.34 > > I have no ability to do fencing yet, so I chose to use the gulm locking > mechanism. I have it setup so that there are 3 lock servers, for > failover. I have tested the failover, and it works quite well. If this is a testing environment use manual fencing. E.g. if a node needs to get fenced you get a log message saying that you should do that and acknowledge that. > I have 5 machines in the cluster. 1 isn't connected to the SAN, or > using GFS. It is just a failover gulm lock server incase the other two > lock servers go down. > > So I have 4 machines connected to our SAN and using GFS. 3 are > read-only, 1 is read-write. If it is important, the 3 read-only are > x86_64, the 1 read-write and the 1 not connected are i386. > > The read/write machine is our master lock server. Then one of the > read-only is a fallback lock server, as is the machine not using GFS. > > Anyway, we're getting these pauses when writting, and I'm having a hard > time tracking down where the problem is. I *think* that we can still > read from the other machines. But since this comes and goes, I haven't > been able to verify that. What SAN hardware is attached to the nodes? > Anyway, which tools do you think would be best in diagnosing this? I'd suggest to check/monitor networking. Also place the cluster communication on a separate network that the SAN/LAN network. The cluster heartbeat goes over UDP and a congested network may delay these packages or drop the completely. At least that's the CMAN picture, lock_gulm may be different. Also don't mix RHELU1 and U2 or FC<N>. Just in case you'd like to upgrade to SL4.2 one by one. There have been many changes/bug fixes to the cluster bits in RHELU2, and there are also some new spiffy features like multipath. Perhaps it's worth rebasing your testing environment? -- Axel.Thimm at ATrpms.net
Attachment:
pgpSCZfSAuNLf.pgp
Description: PGP signature
-- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster