Sorry about the duplicate message--I had sent this when I had a mistake
in my email address. When I fixed it, this message apparently went
through to the list.
jonathan
Jonathan Woytek wrote:
Hello. I've tried to read-up on the lists here to see what I can find
about these sorts of issues, but the information appears to be somewhat
sparse.
Here's my situation: I have a two-member cluster built on RHEL 3 AS
(with all current updates installed). That means kernel
2.4.21-27.0.2.EL with GFS (6.0.2-25) and cluster services (1.2.16-1)
built from SRPMS distributed by RedHat. My storage is iSCSI-based over
gigabit ethernet. Hardware are Dell PowerEdge 1860's with 4GB of RAM
and dual 2.4GHz processors.
My problem is that the node serving disk via NFS and Samba gets into a
strange mode where it starts to get kernel-based out-of-memory errors,
which start to kill things off. The machine reboots itself and comes
back up with no issues. In the process, of course, it wreaks havoc with
lock_gulmd and a host of other things, and makes a bunch of users upset
(it probably didn't help that we've been dealing with unstable storage
here for a while, and I put this system together with the idea that it
would be more reliable).
I plan on trying to add a third node, which would fix the lock_gulmd
craziness. That's not my big problem, though. I NEED to figure out why
this is happening. My analysis so far seems to indicate that the
crashes are caused mostly when there are a lot of files open (or at
least a lot of disk activity). The failures seem to occur most often
when people are accessing data (on GFS) from the server over an NFS
mount to another machine, but they also seem to occur if the machine has
seen a day's worth of that sort of usage and the backup system tries to
get its nightly backup between 11PM and 2AM. When memory starts to get
low, kswapd shows up and starts eating serious cycles, along with the
nfsd's. I've tried increasing the number of nfsd's, but that didn't
seem to have an effect.
Any ideas on things I should be checking? Interestingly enough, no swap
seems to be used when this happens. The load average normally creeps up
right before death, and the machine gets down to less than 18MB free
(though a lot the 4GB is tied up in cache).
jonathan
--
Jonathan Woytek w: 412-681-3463 woytek+@xxxxxxx
NREC Computing Manager c: 412-401-1627 KB3HOZ
PGP Key available upon request
--
Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster
--
Jonathan Woytek w: 412-681-3463 woytek+@xxxxxxx
NREC Computing Manager c: 412-401-1627 KB3HOZ
PGP Key available upon request