RE: [Linux-cluster] STONITH

"Kovacs, Corey J." <cjk@xxxxxxxxxx> · Fri, 6 Oct 2006 11:57:03 -0400

What exactly do you mean by outage? Power outage? If so, 
power for what?  Just network gear? As far as I
know the MSA500 shouldn't "timeout" it's a hard scsi 
connection thats not in any way network dependant. I 
probably missed something but I'm not clear on your 
description of what happened. If it's scsi timeoutes, then
see below about profiles.

The MSA will not failover correctly under Linux unless the 
"profile" for the connections defined in the controllers
are set up correctly. Even if it's been done in the past, 
check it again. I've had the profile setting reset to the 
defaults after updating firmware. Even then, there 
needs to be I/O going down the pipe in order for the 
controllers
to failover 
correctly.

If everything went down, then I can almost gaurentee that 
the nodes came back online before the MSA was
operational again. These they're pretty slow booting and 
I'd bet just about any computer will boot way before
the MSA will and this not be able to see any of the devices 
it presents. A reboot of the nodes then fixes that 
problem.

Aside from all of this, you probably need to figure out why 
the primary controller failed in the first place. The fact
that the redundancy failed on you is not good. Sounds like 
it failed over but you likely have other issues that are
preventing the device paths from being 
maintained.

FInally, if all else is good, try forcelby failing the 
controllers over by pulling the active one out and see how long 
it
takes to recover. Then set your heartbeat tineout slightly 
longer than that value.

Hope the ramble helps.

Corey

From: linux-cluster-bounces@xxxxxxxxxx 
[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Grant 
Waters
Sent: Friday, October 06, 2006 7:11 AM
To: 
linux-cluster@xxxxxxxxxx
Subject:  STONITH 

I had a quick search through your 
threads but couldn't find an exact hit which includes a resolution so I thought 
I'd try posting this here. 

We have a 
two node RH ES 3.0 cluster which uses an MSA 500 G2 shared array with a single 
LUN, and a crossover cable set up as eth1 for heartbeat. 
Both nodes are dual fed through an NPS power 
switch. 

All works fine and has done 
for 18 months but we've had 2 outages recently where the following 
happens... 

We appear to lose eth1, 
and the MSA 500 G2 starts timing out, and by the time I get in in the morning I 
can see errors on the MSA 500 G2 LCDs saying "43 REDUNDANCY FAILED" and "POWER 
OK" resepctively on the secondary and primary controllers. 

Both servers are up, but the failover node appears to 
have been forcibly rebooted by STONITH, with 2 plugs in the NPS being turned off 
& on again. 

This leaves neither 
node able to talk to the shared array, and the service down. 

Powering cycling both nodes and the array 
fixes the problem, but I want to know whats causing it in the first place. 
 It doesn't appear to be related to load, although I can't rule that out - 
both outages were at approx 04:40 on a Friday. 

Here are the key msgs from syslog... 

Sep 29 04:44:50 node1 kernel: tg3: eth1: Link is 
down. 
Sep 29 04:44:51 node1 kernel: 
cciss: cmd f79252b0 timedout 
.......~100 
of these 
Sep 29 04:44:51 node1 kernel: 
cciss: cmd f79216f8 timedout 
Sep 29 
04:44:53 node1 kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex. 

Sep 29 04:44:53 node1 kernel: tg3: eth1: Flow 
control is off for TX and off for RX. 
Sep 29 04:45:03 node1 clumembd[2411]: <info> Membership View 
#3:0x00000001 
Sep 29 04:45:04 node1 
cluquorumd[2389]: <warning> --> Commencing STONITH <-- 

Sep 29 04:45:06 node1 cluquorumd[2389]: Power 
to NPS outlet(s) 6 turned /Off. 
Sep 29 
04:45:07 node1 kernel: tg3: eth1: Link is down. 
Sep 29 04:45:08 node1 cluquorumd[2389]: Power to NPS outlet(s) 2 turned 
/Off. 
Sep 29 04:45:08 node1 
cluquorumd[2389]: <notice> STONITH: node2-h has been fenced! 

Sep 29 04:45:10 node1 cluquorumd[2389]: Power 
to NPS outlet(s) 6 turned /On. 
Sep 29 
04:45:12 node1 cluquorumd[2389]: Power to NPS outlet(s) 2 turned /On. 

Sep 29 04:45:12 node1 cluquorumd[2389]: 
<notice> STONITH: node2-h is no longer fenced off. 
Sep 29 04:45:14 node1 kernel: tg3: eth1: Link is up at 
1000 Mbps, full duplex. 
Sep 29 04:45:14 
node1 kernel: tg3: eth1: Flow control is off for TX and off for RX. 

Sep 29 04:47:41 node1 kernel: tg3: eth1: Link 
is down. 
Sep 29 04:47:44 node1 kernel: 
tg3: eth1: Link is up at 1000 Mbps, full duplex. 
Sep 29 04:47:44 node1 kernel: tg3: eth1: Flow control is 
on for TX and on for RX. 

I thought 
it would go again this morning so I turned up the cluster daemon loglevels, and 
unfortunately it didn't crash but I spotted this in the debug msgs.... 

Oct  6 04:39:31 node1 clulockd[2462]: 
<debug> ioctl(fd,SIOCGARP,ar [eth1]): No such device or 
address 
Oct  6 04:39:31 node1 
clulockd[2462]: <debug> Connect: Member #1 (192.168.100.101) [IPv4] 

Oct  6 04:39:31 node1 clulockd[2462]: 
<debug> Processing message on 11 
Oct  6 04:39:31 node1 clulockd[2462]: <debug> Received 188 
bytes from peer 
Oct  6 04:39:31 
node1 clulockd[2462]: <debug> LOCK_LOCK | LOCK_TRYLOCK 
Oct  6 04:39:31 node1 clulockd[2462]: <debug> 
lockd_trylock: member #1 lock 0 
Oct 
 6 04:39:31 node1 clulockd[2462]: <debug> Replying ACK 

The point is the cluster is working fine, 
and fails over and back fine.  I can telnet onto the NPS from both nodes so 
thats OK too. 
As far as I can tell eth1 
is set up OK, and working across 192.168 addresses. 

Any ideas where to start looking at this? 

Regards,
GXW 
 :o)

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster