On Mar 11, 2008, at 10:13 AM, Bob Peterson wrote:
On Mon, 2008-03-10 at 10:28 -0400, James Chamberlain wrote:
I have just had my cluster crash yet again, but this time, I was
able to
capture the full kernel panic.
<snip>
I'm experiencing upwards of 8 crashes a day because of this. What
can I do
about it?
Thanks,
James
Hi James,
The only times I've seen a problem like this is when GFS's resource
group information somehow got corrupted. I recommend doing this:
1. Unmount the file system from all nodes in your cluster
Is there an easy way to determine which filesystem(s) it is? I have 13.
2. Back up your storage in any way you can without it being mounted
(dd it to another storage or tape or something?)
3. Run gfs_fsck on the file system. If this is > 15TB, make sure
you run it on a 64-bit node.
All nodes in this cluster are 64-bit. Are there any guidelines on how
much memory I should have in each node? Right now, they each have 2 GB.
Hopefully your system isn't too old and you have a relatively recent
version of gfs_fsck, which has the smarts to repair damaged RGs.
gfs-utils-0.1.12-1.el5
I'm just guessing about the corruption, but given that, the
next question is how it got corrupted. There are a number of ways
that can happen. For example hardware problems, or running gfs_fsck
while the file system is mounted on some node. BTW, I've only seen
RG corruption two or three times in the past 2+ years.
Is there a way I can find out for sure whether it's resource group
corruption before I run gfs_fsck?
I have only had this cluster set up since December, and I started
having problems with it not long after that. At first, I was seeing a
crash a day, and then I was having maybe one crash a week; however, I
had a total of 47 reboots within the cluster yesterday. I have also
been somewhat concerned about the high load average on each node where
a service is running. For example, one node is serving 5 of those 13
filesystems. Its load average is currently and commonly hovering
between 35 and 55. On nodes that aren't running any services, the
load average is 0.
Thanks,
James
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster