Re: [Linux-cluster] OOM failures with GFS, NFS and Samba on a cluster with RHEL3-AS

Jonathan Woytek <woytek+@xxxxxxx> · Sun, 23 Jan 2005 23:27:52 -0500

Even more additional information:

I've been monitoring the system through a few crashes now, and it looks 
like what is actually running out of memory is "lowmem".  The system 
seems to eat about 130-140kB every two seconds.  It seems that the 
system is NOT actually plowing through 3GB+ of memory--highmem does not 
seem to drop.

Whee fun.

jonathan

Jonathan Woytek wrote:

Additional information:

I enabled full output on lock_gulmd, since my dead top sessions would 
often show that process near the top of the list around the time of 
crashes.  The machine was rebooted around 10:50AM, and was down again at 
12:44.  In the span of less than a minute, the machine plowed through 
over 3GB of memory and crashed.  The extra debugging information from 
lock_gulmd said nothing, except that there was a successful heartbeat. 
The OOM messages began at 12:44:01, and the machine was dead somewhere 
around 12:44:40.  Nobody should be using the machine during this time. A 
cron job that was scheduled to fire off at 12:44 (it runs every two 
minutes to check memory usage, specifically to try to track this 
problem) did not run (or at least was not logged if it did).  I took 
that job out of cron just to make sure that it isn't part of the 
problem.  The low-memory-check that ran at 12:42 reported nothing, and 
my threshold for that is set at 512MB.

The span between crashes this weekend has been between three and eight 
hours.  Yesterday, the machine rebooted (looking at lastlog, not last 
message before restart in /var/log/messages, but I'll be looking at that 
in a bit) at 15:20 (after being up since 23:50 on Friday), 18:27, 21:43, 
 onto sunday at 01:14, 04:33, and finally 12:48.  Something seems quite 
wrong with this.

jonathan

Jonathan Woytek wrote:

I have been experiencing OOM failures (followed by reboots) on a 
cluster running Dell PowerEdge 1860's (dual-proc, 4GB RAM) with 
RHEL3-AS with all current updates.

The system is configured as a two-member cluster, running GFS 6.0.2-25 
(RH SRPM) and cluster services 1.2.16-1 (also RH SRPM).  My original 
testing went fine with the cluster, including service fail-over and 
all that stuff (only one lock_gulmd, so if the master goes down, the 
world explodes--but I expected that).

Use seemed to be okay, but there weren't a whole lot of users. 
Recently, a project wanted to serve some data from their space in GFS 
via their own machine.  We mounted their space via NFS from the 
cluster, and they serve their data via samba from their machine.  
Shortly thereafter, two things happened:  more people started to 
access the data, and the cluster machines started to crash.  The 
symptoms are that free memory drops extremely quickly (sometimes more 
than 3GB disappears in less than two minutes).  Load average usually 
goes up quickly (when I can see it).  NFS processes are normally at 
the top of top, along with kswapd.  At some point, around this time, 
the kernel starts to spit out OOM messages and it starts to kill 
bunches of processes.  The machine eventually reboots itself and comes 
back up cleanly.

Space of outages seems to be dependent on how many people are using 
the system, but I've also seen the machine go down when the backup 
system runs a few backups on the machine.  One of the things I've 
noticed, though, is that the backup system doesn't actually cause the 
machine to crash if the system has been recently rebooted, and memory 
usage returns to normal after the backup is finished.  Memory usage 
usually does NOT return to completely normal after the gigabytes of 
memory become used (when that happens, the machine will sit there and 
keep running for a while with only 20MB or less free, until something 
presumably tries to use that memory and the machine flips out).  That 
is the only time I've seen the backup system cause the system to 
crash--after it has endured significant usage during the day and there 
are 20MB or less free.

I'll usually get a call from the culprits telling me that they were 
copying either a) lots of files or b) large files to the cluster.

Any ideas here?  Anything I can look at to tune?

jonathan

--

Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster

--

Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster