Re: corosync ring failure

"C. Handel" <christoph@xxxxxxxxxxxxxx> · Thu, 24 Jul 2014 09:30:01 +0200

>>> i run a cluster with two corosync rings. One of the rings is marked
>>> faulty every fourty seconds, to immediately recover a second later.
>>> the other ring is stable
>>>
>>> i have no idea how i should debug this.
>>>
>>>
>>> we are running sl6.5 with pacemaker 1.1.10, cman 3.0.12, corosync 1.4.1
>>> cluster consists of three machines. Ring1 is running on 10gigbit
>>> interfaces, Ring0 on 1gigibit interfaces. Both rings don't leave their
>>> respective switch.

>> Any logs in the switch? Is the multicast group being deleted/recreated?

> believe there would be no multicast for UDPU transport

>Can you check to see if any of the devices (servers and switches) is >dropping
>UDP packets, be it for congestion or damage?

the switch has no load, interface utilization is below 10%, no crc
errors on the ports and no errors in the log. On the same switch a
second cluster (four machines, similiar config) is running fine.

Greetings
   Christoph

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster