Re: Kernel panic

James Chamberlain <jamesc@xxxxxxx> · Tue, 11 Mar 2008 12:45:10 -0400

On Mar 11, 2008, at 10:13 AM, Bob Peterson wrote:

On Mon, 2008-03-10 at 10:28 -0400, James Chamberlain wrote:
I have just had my cluster crash yet again, but this time, I was  
able to
capture the full kernel panic.
<snip>
I'm experiencing upwards of 8 crashes a day because of this.  What  
can I do
about it?

Thanks,

James

Hi James,

The only times I've seen a problem like this is when GFS's resource
group information somehow got corrupted.  I recommend doing this:

1. Unmount the file system from all nodes in your cluster

Is there an easy way to determine which filesystem(s) it is?  I have 13.

2. Back up your storage in any way you can without it being mounted
  (dd it to another storage or tape or something?)
3. Run gfs_fsck on the file system.  If this is > 15TB, make sure
  you run it on a 64-bit node.

All nodes in this cluster are 64-bit.  Are there any guidelines on how  
much memory I should have in each node?  Right now, they each have 2 GB.

Hopefully your system isn't too old and you have a relatively recent
version of gfs_fsck, which has the smarts to repair damaged RGs.

gfs-utils-0.1.12-1.el5

I'm just guessing about the corruption, but given that, the
next question is how it got corrupted.  There are a number of ways
that can happen.  For example hardware problems, or running gfs_fsck
while the file system is mounted on some node.  BTW, I've only seen
RG corruption two or three times in the past 2+ years.

Is there a way I can find out for sure whether it's resource group  
corruption before I run gfs_fsck?

I have only had this cluster set up since December, and I started  
having problems with it not long after that.  At first, I was seeing a  
crash a day, and then I was having maybe one crash a week; however, I  
had a total of 47 reboots within the cluster yesterday.  I have also  
been somewhat concerned about the high load average on each node where  
a service is running.  For example, one node is serving 5 of those 13  
filesystems.  Its load average is currently and commonly hovering  
between 35 and 55.  On nodes that aren't running any services, the  
load average is 0.

Thanks,

James

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster