I have set the network to udpu. The physical nodes are to replace the virtual nodes. I was planning on decommissioning the virtual nodes when the cluster was stable with the physical nodes. I will also remove the virtual nodes from the cluster and see if it makes any difference. When I was only running the two virtual nodes I did not have any of these issues. On 6/19/14, 6:02 AM, "Christine Caulfield" <ccaulfie@xxxxxxxxxx> wrote: >On 17/06/14 15:27, Schaefer, Micah wrote: >> I am running Red Hat 6.4 with the HA/ load balancing packages from the >> install DVD. >> >> >> -bash-4.1$ cat /etc/redhat-release >> Red Hat Enterprise Linux Server release 6.4 (Santiago) >> >> -bash-4.1$ corosync -v >> Corosync Cluster Engine, version '1.4.1' >> Copyright (c) 2006-2009 Red Hat, Inc. >> >> > > >Thanks. 6.5 has better pause detection in it but I don't think that's >the issue here actually. It looks to me like some messages are getting >through but not others. So I'm back to seriously wondering if multicast >traffic is being forwarded correctly and reliably. Having a mix of >virtual and physical systems can cause these sorts of issues with real >and software switches being mixed. Though I haven't seen anything quite >as odd as this to be honest. > >Can you try either UDPU (preferred) or broadcast transport please and >see if that helps or changes the symptoms at all? Broadcast could be >problematic itself with the real/virtual mix so UDPU will be a more >reliable option. > >Annoyingly, you'll need to take down the whole cluster to do this, and add > ><cman transport="udpu"/> > >to /etc/cluster/cluster.conf on all nodes. > >Chrissie > > > >> >> On 6/17/14, 8:41 AM, "Christine Caulfield" <ccaulfie@xxxxxxxxxx> wrote: >> >>> On 12/06/14 20:06, Digimer wrote: >>>> Hrm, I'm not really sure that I am able to interpret this without >>>>making >>>> guesses. I'm cc'ing one of the devs (who I hope will poke the right >>>> person if he's not able to help at the moment). Lets see what he has >>>>to >>>> say. >>>> >>>> I am curious now, too. :) >>>> >>>> On 12/06/14 03:02 PM, Schaefer, Micah wrote: >>>>> Node4 was fenced again, I was able to get some debug logs (below), a >>>>> new >>>>> message : >>>>> >>>>> "Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the >>>>> OPERATIONAL >>>>> state.³ >>>>> >>>>> >>>>> Rest of corosync logs >>>>> >>>>> http://pastebin.com/iYFbkbhb >>>>> >>>>> >>>>> Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state. >>>>> Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the >>>>> membership and a new membership was formed. >>>>> Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0 >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 >>>>>ms, >>>>> flushing membership messages. >>>>> Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 >>>>>ms, >>>>> flushing membership messages. >>> >>> >>> I'm concerned that the pause messages are repeating like that, it looks >>> like it might be a fixed bug. What version of corosync do you have? >>> >>> Chrissie >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster@xxxxxxxxxx >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > >-- >Linux-cluster mailing list >Linux-cluster@xxxxxxxxxx >https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster