I have a third node unable to join my cluster (RHEL 6.3). It fails at 'joining fence domain'. Though I suspect that's a bit of a red herring. The log isn't telling me much, even though I've increased verbosity. Can someone point me in the right direction as to how to debug? The error: Joining fence domain... fence_tool: waiting for fenced to join the fence group. fence_tool: fenced not running, no lockfile >From fenced.log: Jul 11 13:17:54 fenced fenced 3.0.12.1 started Jul 11 13:17:55 fenced cpg_join fenced:daemon ... And then the only errors/warning I see in corosync.log: Jul 11 13:17:54 corosync [CMAN ] daemon: About to process command Jul 11 13:17:54 corosync [CMAN ] memb: command to process is 90 Jul 11 13:17:54 corosync [CMAN ] memb: command return code is 0 Jul 11 13:17:54 corosync [CMAN ] daemon: Returning command data. length = 440 Jul 11 13:17:54 corosync [CMAN ] daemon: sending reply 40000090 to fd 18 Jul 11 13:17:54 corosync [CMAN ] daemon: read 0 bytes from fd 18 Jul 11 13:17:54 corosync [CMAN ] daemon: Freed 0 queued messages Jul 11 13:17:54 corosync [TOTEM ] Received ringid(10.128.32.22:28272) seq 61 Jul 11 13:17:54 corosync [TOTEM ] Delivering 2 to 61 Jul 11 13:17:54 corosync [TOTEM ] Delivering 2 to 61 Jul 11 13:17:54 corosync [TOTEM ] FAILED TO RECEIVE Jul 11 13:17:54 corosync [TOTEM ] entering GATHER state from 6. Jul 11 13:17:54 corosync [CONFDB] lib_init_fn: conn=0xd78100 Jul 11 13:17:54 corosync [CONFDB] exit_fn for conn=0xd78100 Jul 11 13:17:54 corosync [CONFDB] lib_init_fn: conn=0xd78100 Jul 11 13:17:54 corosync [CONFDB] exit_fn for conn=0xd78100 Jul 11 13:17:54 corosync [CONFDB] lib_init_fn: conn=0xd78100 Jul 11 13:17:54 corosync [CONFDB] exit_fn for conn=0xd78100 Jul 11 13:17:54 corosync [CMAN ] daemon: read 20 bytes from fd 18 <snip> Jul 11 13:17:59 corosync [CMAN ] daemon: About to process command Jul 11 13:17:59 corosync [CMAN ] memb: command to process is 90 Jul 11 13:17:59 corosync [CMAN ] memb: cmd_get_node failed: id=0, name='<CC>^?' Jul 11 13:17:59 corosync [CMAN ] memb: command return code is -2 Jul 11 13:17:59 corosync [CMAN ] daemon: Returning command data. length = 0 Jul 11 13:17:59 corosync [CMAN ] daemon: sending reply 40000090 to fd 23 Jul 11 13:17:59 corosync [CMAN ] daemon: read 0 bytes from fd 23 Jul 11 13:17:59 corosync [CMAN ] daemon: Freed 0 queued messages Jul 11 13:17:59 corosync [CMAN ] daemon: read 20 bytes from fd 23 Jul 11 13:17:59 corosync [CMAN ] daemon: client command is 5 Jul 11 13:17:59 corosync [CMAN ] daemon: About to process command Jul 11 13:17:59 corosync [CMAN ] memb: command to process is 5 Jul 11 13:17:59 corosync [CMAN ] daemon: Returning command data. length = 0 Jul 11 13:17:59 corosync [CMAN ] daemon: sending reply 40000005 to fd 23 <snip> Back in fenced.log: Jul 11 13:18:05 fenced daemon cpg_join error retrying Jul 11 13:18:15 fenced daemon cpg_join error retrying Jul 11 13:18:21 fenced daemon cpg_join error 2 Jul 11 13:18:23 fenced cpg_leave fenced:daemon ... Jul 11 13:18:23 fenced daemon cpg_leave error 9 And in /var/log/messages: Jul 11 13:17:50 server3 corosync[31116]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Jul 11 13:17:50 server3 corosync[31116]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Jul 11 13:17:50 server3 corosync[31116]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jul 11 13:17:50 server3 corosync[31116]: [QUORUM] Members[1]: 3 Jul 11 13:17:50 server3 corosync[31116]: [QUORUM] Members[1]: 3 Jul 11 13:17:50 server3 ntpd[1747]: synchronized to 10.135.136.17, stratum 1 Jul 11 13:17:50 server3 corosync[31116]: [CPG ] chosen downlist: sender r(0) ip(10.130.32.32) ; members(old:0 left:0) Jul 11 13:17:50 server3 corosync[31116]: [MAIN ] Completed service synchronization, ready to provide service. Jul 11 13:17:50 server3 corosync[31116]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jul 11 13:17:50 server3 corosync[31116]: [CMAN ] quorum regained, resuming activity Jul 11 13:17:50 server3 corosync[31116]: [QUORUM] This node is within the primary component and will provide service. Jul 11 13:17:50 server3 corosync[31116]: [QUORUM] Members[2]: 2 3 Jul 11 13:17:50 server3 corosync[31116]: [QUORUM] Members[2]: 2 3 Jul 11 13:17:54 server3 corosync[31116]: [TOTEM ] FAILED TO RECEIVE Jul 11 13:17:54 server3 fenced[31174]: fenced 3.0.12.1 started Jul 11 13:17:55 server3 dlm_controld[31192]: dlm_controld 3.0.12.1 started Jul 11 13:18:05 server3 dlm_controld[31192]: daemon cpg_join error retrying Jul 11 13:18:05 server3 fenced[31174]: daemon cpg_join error retrying Jul 11 13:18:05 server3 gfs_controld[31264]: gfs_controld 3.0.12.1 started Jul 11 13:18:15 server3 dlm_controld[31192]: daemon cpg_join error retrying Jul 11 13:18:15 server3 fenced[31174]: daemon cpg_join error retrying Jul 11 13:18:15 server3 gfs_controld[31264]: daemon cpg_join error retrying Jul 11 13:18:19 server3 abrtd: Directory 'ccpp-2012-07-11-13:18:18-31116' creation detected Jul 11 13:18:19 server3 abrt[31313]: Saved core dump of pid 31116 (/usr/sbin/corosync) to /var/spool/abrt/ccpp-2012-07-11-13:18:18-31116 (47955968 Jul 11 13:18:21 server3 dlm_controld[31192]: daemon cpg_join error 2 Jul 11 13:18:21 server3 gfs_controld[31264]: daemon cpg_join error 2 Jul 11 13:18:21 server3 fenced[31174]: daemon cpg_join error 2 Jul 11 13:18:23 server3 kernel: dlm: closing connection to node 3 Jul 11 13:18:23 server3 kernel: dlm: closing connection to node 2 Jul 11 13:18:23 server3 dlm_controld[31192]: daemon cpg_leave error 9 Jul 11 13:18:23 server3 gfs_controld[31264]: daemon cpg_leave error 9 Jul 11 13:18:23 server3 fenced[31174]: daemon cpg_leave error 9 Jul 11 13:18:30 server3 abrtd: Sending an email... Jul 11 13:18:30 server3 abrtd: Email was sent to: root@localhost Jul 11 13:18:30 server3 abrtd: Duplicate: UUID Jul 11 13:18:30 server3 abrtd: DUP_OF_DIR: /var/spool/abrt/ccpp-2012-07-06-10:30:40-22107 Jul 11 13:18:30 server3 abrtd: Problem directory is a duplicate of /var/spool/abrt/ccpp-2012-07-06-10:30:40-22107 Jul 11 13:18:30 server3 abrtd: Deleting problem directory ccpp-2012-07-11-13:18:18-31116 (dup of ccpp-2012-07-06-10:30:40-22107) Any tips much appreciated. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster