Running a two machine cluster is a bad thing (but due to budget limitation, I am doing the same bad thing). If something happens between the two machine, they fence each other. In this particular case, I think you have some sort of network problem between the two machine. You can try to "ping" each other and see, when the problem arise, the connectivity state. Maybe a "too much intelligent switch" is handling the traffic and have some sort of "traffic shaping and control". Leandro -----Messaggio originale----- Da: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] Per conto di Fabrizio Lippolis Inviato: martedì 27 giugno 2006 10.40 A: linux-cluster@xxxxxxxxxx Oggetto: "Missed too many heartbeats" messages and hungcluster I have configured two machines in a cluster domain to run mysql and ldap services. Everything works correctly except that from time to time, seems randomly, the two machines hung. Recently this is what I see in the log of the second machine: Jun 23 23:37:17 AICLSRV02 kernel: CMAN: removing node AICLSRV01 from the cluster : Missed too many heartbeats Jun 23 23:37:17 AICLSRV02 fenced[2004]: AICLSRV01 not a cluster member after 0 sec post_fail_delay Jun 23 23:37:17 AICLSRV02 fenced[2004]: fencing node "AICLSRV01" Jun 23 23:37:17 AICLSRV02 fence_manual: Node AICLSRV01 needs to be reset before recovery can procede. Waiting for AICLSRV01 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n AICLSRV01) A few seconds later the same messages appeared on the first machine: Jun 23 23:37:36 AICLSRV01 kernel: CMAN: removing node AICLSRV02 from the cluster : Missed too many heartbeats Jun 23 23:37:36 AICLSRV01 fenced[2084]: AICLSRV02 not a cluster member after 0 sec post_fail_delay Jun 23 23:37:36 AICLSRV01 fenced[2084]: fencing node "AICLSRV02" Jun 23 23:37:39 AICLSRV01 fence_manual: Node AICLSRV02 needs to be reset before recovery can procede. Waiting for AICLSRV02 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n AICLSRV02) The two machines have been resetted to let them work again. Anybody could please explain what happened to cause this problem? I would also need a suggestion on how to configure a fence device so that the services could still continue to work. As you see actually I configured manual fence but that's not much useful. Thank you in advance. -- Fabrizio Lippolis fabrizio.lippolis@xxxxxxxxxxxxxxxxxxxx Auriga Informatica s.r.l. Via Don Guanella 15/B - 70124 Bari Tel.: 080/5025414 - Fax: 080/5027448 - http://www.aurigainformatica.it/ -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster