Re: fencing for no reason that I can see

Heiko Nardmann <heiko.nardmann@xxxxxxxxxxxxx> · Tue, 11 Sep 2012 17:17:37 +0200

Hi,

I had similar problems. The problem turned out to be that the firmware 
for the Broadcom NICs inside of our Dell R610 has been obsolete resp. 
buggy. So depending on your hardware please have the vendor check your 
firmware/BIOS/... versions - might help ...

Kind regards,

    Heiko

Am 11.09.2012 03:27, schrieb Terry:
Hello,

I have seen this a few times where one node stops seeing the other
node for some unknown reason and fences it.  Any idea how I can debug
this?  Here's from the node doing the fencing:

Sep 10 19:01:23 omadvnfs01a corosync[10371]:   [TOTEM ] A processor
failed, forming new configuration.
Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [QUORUM] Members[1]: 1
Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [TOTEM ] A processor
joined or left the membership and a new membership was formed.
Sep 10 19:01:25 omadvnfs01a rgmanager[10692]: State change:
omadvnfs01b.sec.jel.lc DOWN
Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [CPG   ] chosen
downlist: sender r(0) ip(10.198.1.110) ; members(old:2 left:1)
Sep 10 19:01:25 omadvnfs01a corosync[10371]:   [MAIN  ] Completed
service synchronization, ready to provide service.
Sep 10 19:01:25 omadvnfs01a fenced[10427]: fencing node omadvnfs01b.sec.jel.lc

And here is from the fenced node:

Sep 10 17:09:27 omadvnfs01b rpc.idmapd[6126]: nfsdcb:
read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of
File)
Sep 10 17:14:47 omadvnfs01b rpc.idmapd[6125]: nfsdcb:
read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of
File)
Sep 10 19:04:44 omadvnfs01b kernel: imklog 5.8.10, log source =
/proc/kmsg started.
Sep 10 19:04:44 omadvnfs01b rsyslogd: [origin software="rsyslogd"
swVersion="5.8.10" x-pid="2379" x-info="http://www.rsyslog.com";] start

I did notice that they were about 40 seconds off in time.  I just
fixed that but what else can I look for here.  Our monitoring started
noticing things at 19:02:30 that the fenced node was off the grid
which is a little after it was fenced.  What test is performed to see
if the other node is up?  How many times does it try?

Thanks!

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster