I have configured two machines in a cluster domain to run mysql and ldap
services. Everything works correctly except that from time to time,
seems randomly, the two machines hung. Recently this is what I see in
the log of the second machine:
Jun 23 23:37:17 AICLSRV02 kernel: CMAN: removing node AICLSRV01 from the
cluster : Missed too many heartbeats
Jun 23 23:37:17 AICLSRV02 fenced[2004]: AICLSRV01 not a cluster member
after 0 sec post_fail_delay
Jun 23 23:37:17 AICLSRV02 fenced[2004]: fencing node "AICLSRV01"
Jun 23 23:37:17 AICLSRV02 fence_manual: Node AICLSRV01 needs to be reset
before recovery can procede. Waiting for AICLSRV01 to rejoin the
cluster or for manual acknowledgement that it has been reset (i.e.
fence_ack_manual -n AICLSRV01)
A few seconds later the same messages appeared on the first machine:
Jun 23 23:37:36 AICLSRV01 kernel: CMAN: removing node AICLSRV02 from the
cluster : Missed too many heartbeats
Jun 23 23:37:36 AICLSRV01 fenced[2084]: AICLSRV02 not a cluster member
after 0 sec post_fail_delay
Jun 23 23:37:36 AICLSRV01 fenced[2084]: fencing node "AICLSRV02"
Jun 23 23:37:39 AICLSRV01 fence_manual: Node AICLSRV02 needs to be reset
before recovery can procede. Waiting for AICLSRV02 to rejoin the
cluster or for manual acknowledgement that it has been reset (i.e.
fence_ack_manual -n AICLSRV02)
The two machines have been resetted to let them work again. Anybody
could please explain what happened to cause this problem? I would also
need a suggestion on how to configure a fence device so that the
services could still continue to work. As you see actually I configured
manual fence but that's not much useful. Thank you in advance.
Fabrizio Lippolis fabrizio.lippolis@xxxxxxxxxxxxxxxxxxxx
Auriga Informatica s.r.l. Via Don Guanella 15/B - 70124 Bari
Tel.: 080/5025414 - Fax: 080/5027448 - http://www.aurigainformatica.it/