Re: corosync ring failure

Michael Schwartzkopff <misch@xxxxxxxxxxxxxxxxx> · Thu, 24 Jul 2014 09:42:08 +0200

Am Donnerstag, 24. Juli 2014, 09:30:01 schrieb C. Handel:
> >>> i run a cluster with two corosync rings. One of the rings is marked
> >>> faulty every fourty seconds, to immediately recover a second later.
> >>> the other ring is stable
> >>> 
> >>> i have no idea how i should debug this.
> >>> 
> >>> 
> >>> we are running sl6.5 with pacemaker 1.1.10, cman 3.0.12, corosync 1.4.1
> >>> cluster consists of three machines. Ring1 is running on 10gigbit
> >>> interfaces, Ring0 on 1gigibit interfaces. Both rings don't leave their
> >>> respective switch.
> >> 
> >> Any logs in the switch? Is the multicast group being deleted/recreated?
> > 
> > believe there would be no multicast for UDPU transport
> >
> >Can you check to see if any of the devices (servers and switches) is
> >>dropping UDP packets, be it for congestion or damage?
> 
> the switch has no load, interface utilization is below 10%, no crc
> errors on the ports and no errors in the log. On the same switch a
> second cluster (four machines, similiar config) is running fine.

Any Spanning Tree Problems? Dou you have any bridges (i.e. for virtual 
machines) configured in your setup?

did you do some debug on your switch?

Greetings,

-- 
Dr. Michael Schwartzkopff
Guardinistr. 63
81375 München

Tel: (0162) 1650044
Fax: (089) 620 304 13
Attachment:
signature.asc

Description: This is a digitally signed message part.
-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster