Fw: STONITH

Grant Waters <gwaters1@xxxxxxx> · Fri, 6 Oct 2006 13:03:00 +0100

 Forgot to say I also get the following
msgs in syslog when I telnet to the NPS....

Oct  6 12:53:34 node1 cluquorumd[27339]:
Cannot log into WTI Network/Telnet Power Switch.

Oct  6 12:53:34 node1 cluquorumd[27339]:
<err> STONITH: Device at xx.xxx.xxx.xxx controlling node2-h FAILED
status check: Bad configuration

Oct  6 12:53:47 node1 cluquorumd[2384]:
<crit> Error returned from STONITH device(s) controlling node1-h.
See system logs on node2-h for more information.

I obscured the IP address in there -
but it is the correct address of the NPS.

What could this "Bad Config"
be - is it the /etc/cluster.xml?

Regards,

GXW  :o)

----- Forwarded by
Grant Waters/GIS/CSC on 06/10/2006 13:00 -----

Grant Waters/GIS/CSC

06/10/2006 12:11

To
linux-cluster@xxxxxxxxxx

cc

Subject
STONITH 

I had a quick search through your threads
but couldn't find an exact hit which includes a resolution so I thought
I'd try posting this here.

We have a two node RH ES 3.0 cluster
which uses an MSA 500 G2 shared array with a single LUN, and a crossover
cable set up as eth1 for heartbeat.

Both nodes are dual fed through an NPS
power switch.

All works fine and has done for 18 months
but we've had 2 outages recently where the following happens...

We appear to lose eth1, and the MSA
500 G2 starts timing out, and by the time I get in in the morning I can
see errors on the MSA 500 G2 LCDs saying "43 REDUNDANCY FAILED"
and "POWER OK" resepctively on the secondary and primary controllers.

Both servers are up, but the failover
node appears to have been forcibly rebooted by STONITH, with 2 plugs in
the NPS being turned off & on again.

This leaves neither node able to talk
to the shared array, and the service down.

Powering cycling both nodes and the
array fixes the problem, but I want to know whats causing it in the first
place.  It doesn't appear to be related to load, although I can't
rule that out - both outages were at approx 04:40 on a Friday.

Here are the key msgs from syslog...

Sep 29 04:44:50 node1 kernel: tg3: eth1:
Link is down.

Sep 29 04:44:51 node1 kernel: cciss:
cmd f79252b0 timedout

.......~100 of these

Sep 29 04:44:51 node1 kernel: cciss:
cmd f79216f8 timedout

Sep 29 04:44:53 node1 kernel: tg3: eth1:
Link is up at 1000 Mbps, full duplex.

Sep 29 04:44:53 node1 kernel: tg3: eth1:
Flow control is off for TX and off for RX.

Sep 29 04:45:03 node1 clumembd[2411]:
<info> Membership View #3:0x00000001

Sep 29 04:45:04 node1 cluquorumd[2389]:
<warning> --> Commencing STONITH <--

Sep 29 04:45:06 node1 cluquorumd[2389]:
Power to NPS outlet(s) 6 turned /Off.

Sep 29 04:45:07 node1 kernel: tg3: eth1:
Link is down.

Sep 29 04:45:08 node1 cluquorumd[2389]:
Power to NPS outlet(s) 2 turned /Off.

Sep 29 04:45:08 node1 cluquorumd[2389]:
<notice> STONITH: node2-h has been fenced!

Sep 29 04:45:10 node1 cluquorumd[2389]:
Power to NPS outlet(s) 6 turned /On.

Sep 29 04:45:12 node1 cluquorumd[2389]:
Power to NPS outlet(s) 2 turned /On.

Sep 29 04:45:12 node1 cluquorumd[2389]:
<notice> STONITH: node2-h is no longer fenced off.

Sep 29 04:45:14 node1 kernel: tg3: eth1:
Link is up at 1000 Mbps, full duplex.

Sep 29 04:45:14 node1 kernel: tg3: eth1:
Flow control is off for TX and off for RX.

Sep 29 04:47:41 node1 kernel: tg3: eth1:
Link is down.

Sep 29 04:47:44 node1 kernel: tg3: eth1:
Link is up at 1000 Mbps, full duplex.

Sep 29 04:47:44 node1 kernel: tg3: eth1:
Flow control is on for TX and on for RX.

I thought it would go again this morning
so I turned up the cluster daemon loglevels, and unfortunately it didn't
crash but I spotted this in the debug msgs....

Oct  6 04:39:31 node1 clulockd[2462]:
<debug> ioctl(fd,SIOCGARP,ar [eth1]): No such device or address

Oct  6 04:39:31 node1 clulockd[2462]:
<debug> Connect: Member #1 (192.168.100.101) [IPv4]

Oct  6 04:39:31 node1 clulockd[2462]:
<debug> Processing message on 11

Oct  6 04:39:31 node1 clulockd[2462]:
<debug> Received 188 bytes from peer

Oct  6 04:39:31 node1 clulockd[2462]:
<debug> LOCK_LOCK | LOCK_TRYLOCK

Oct  6 04:39:31 node1 clulockd[2462]:
<debug> lockd_trylock: member #1 lock 0

Oct  6 04:39:31 node1 clulockd[2462]:
<debug> Replying ACK

The point is the cluster is working
fine, and fails over and back fine.  I can telnet onto the NPS from
both nodes so thats OK too.

As far as I can tell eth1 is set up
OK, and working across 192.168 addresses.

Any ideas where to start looking at
this?

Regards,

GXW  :o)

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster