On Mon, Sep 10, 2012 at 8:27 PM, Terry <td3201@xxxxxxxxx> wrote: > Hello, > > I have seen this a few times where one node stops seeing the other > node for some unknown reason and fences it. Any idea how I can debug > this? Here's from the node doing the fencing: > > > Sep 10 19:01:23 omadvnfs01a corosync[10371]: [TOTEM ] A processor > failed, forming new configuration. > Sep 10 19:01:25 omadvnfs01a corosync[10371]: [QUORUM] Members[1]: 1 > Sep 10 19:01:25 omadvnfs01a corosync[10371]: [TOTEM ] A processor > joined or left the membership and a new membership was formed. > Sep 10 19:01:25 omadvnfs01a rgmanager[10692]: State change: > omadvnfs01b.sec.jel.lc DOWN > Sep 10 19:01:25 omadvnfs01a corosync[10371]: [CPG ] chosen > downlist: sender r(0) ip(10.198.1.110) ; members(old:2 left:1) > Sep 10 19:01:25 omadvnfs01a corosync[10371]: [MAIN ] Completed > service synchronization, ready to provide service. > Sep 10 19:01:25 omadvnfs01a fenced[10427]: fencing node omadvnfs01b.sec.jel.lc > > > And here is from the fenced node: > > Sep 10 17:09:27 omadvnfs01b rpc.idmapd[6126]: nfsdcb: > read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of > File) > Sep 10 17:14:47 omadvnfs01b rpc.idmapd[6125]: nfsdcb: > read(/proc/net/rpc/nfs4.idtoname/channel) failed: errno 0 (End of > File) > Sep 10 19:04:44 omadvnfs01b kernel: imklog 5.8.10, log source = > /proc/kmsg started. > Sep 10 19:04:44 omadvnfs01b rsyslogd: [origin software="rsyslogd" > swVersion="5.8.10" x-pid="2379" x-info="http://www.rsyslog.com"] start > > > I did notice that they were about 40 seconds off in time. I just > fixed that but what else can I look for here. Our monitoring started > noticing things at 19:02:30 that the fenced node was off the grid > which is a little after it was fenced. What test is performed to see > if the other node is up? How many times does it try? > > Thanks! I guess I should have read the docs more thoroughly. Right from RHEL 6 cluster guide: Ensure that exotic bond modes and VLAN tagging are not in use on interfaces that the cluster uses for inter-node communication. I am using a 3 interface 802.3ad link aggregate on the production network. I could either use an iscsi interface or split one of the three bond slave interfaces out and dedicate it to inter-node traffic. I was also looking into a potential multicast issue but I believe my switches support it fine (Foundry FLS). I wouldnt think it would be intermittent like this. Anyone have any other thoughts? -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster