Solved. It seems the issue was that it was a two-node cluster and adding the third means the cluster has to reconfigure itself from a 2-node to a 3-node cluster which requires a restart of the cluster. I would've expected it could give a clear error message regarding this but seems it just silently fails instead. On Wed, Jul 11, 2012, at 14:26, urgrue wrote: > I have a third node unable to join my cluster (RHEL 6.3). It fails at > 'joining fence domain'. Though I suspect that's a bit of a red herring. > The log isn't telling me much, even though I've increased verbosity. Can > someone point me in the right direction as to how to debug? > > The error: > Joining fence domain... fence_tool: waiting for fenced to join the > fence group. > fence_tool: fenced not running, no lockfile > > >From fenced.log: > Jul 11 13:17:54 fenced fenced 3.0.12.1 started > Jul 11 13:17:55 fenced cpg_join fenced:daemon ... > > And then the only errors/warning I see in corosync.log: > Jul 11 13:17:54 corosync [CMAN ] daemon: About to process command > Jul 11 13:17:54 corosync [CMAN ] memb: command to process is 90 > Jul 11 13:17:54 corosync [CMAN ] memb: command return code is 0 > Jul 11 13:17:54 corosync [CMAN ] daemon: Returning command data. length > = 440 > Jul 11 13:17:54 corosync [CMAN ] daemon: sending reply 40000090 to fd > 18 > Jul 11 13:17:54 corosync [CMAN ] daemon: read 0 bytes from fd 18 > Jul 11 13:17:54 corosync [CMAN ] daemon: Freed 0 queued messages > Jul 11 13:17:54 corosync [TOTEM ] Received ringid(10.128.32.22:28272) > seq 61 > Jul 11 13:17:54 corosync [TOTEM ] Delivering 2 to 61 > Jul 11 13:17:54 corosync [TOTEM ] Delivering 2 to 61 > Jul 11 13:17:54 corosync [TOTEM ] FAILED TO RECEIVE > Jul 11 13:17:54 corosync [TOTEM ] entering GATHER state from 6. > Jul 11 13:17:54 corosync [CONFDB] lib_init_fn: conn=0xd78100 > Jul 11 13:17:54 corosync [CONFDB] exit_fn for conn=0xd78100 > Jul 11 13:17:54 corosync [CONFDB] lib_init_fn: conn=0xd78100 > Jul 11 13:17:54 corosync [CONFDB] exit_fn for conn=0xd78100 > Jul 11 13:17:54 corosync [CONFDB] lib_init_fn: conn=0xd78100 > Jul 11 13:17:54 corosync [CONFDB] exit_fn for conn=0xd78100 > Jul 11 13:17:54 corosync [CMAN ] daemon: read 20 bytes from fd 18 > > <snip> > Jul 11 13:17:59 corosync [CMAN ] daemon: About to process command > Jul 11 13:17:59 corosync [CMAN ] memb: command to process is 90 > Jul 11 13:17:59 corosync [CMAN ] memb: cmd_get_node failed: id=0, > name='<CC>^?' > Jul 11 13:17:59 corosync [CMAN ] memb: command return code is -2 > Jul 11 13:17:59 corosync [CMAN ] daemon: Returning command data. length > = 0 > Jul 11 13:17:59 corosync [CMAN ] daemon: sending reply 40000090 to fd > 23 > Jul 11 13:17:59 corosync [CMAN ] daemon: read 0 bytes from fd 23 > Jul 11 13:17:59 corosync [CMAN ] daemon: Freed 0 queued messages > Jul 11 13:17:59 corosync [CMAN ] daemon: read 20 bytes from fd 23 > Jul 11 13:17:59 corosync [CMAN ] daemon: client command is 5 > Jul 11 13:17:59 corosync [CMAN ] daemon: About to process command > Jul 11 13:17:59 corosync [CMAN ] memb: command to process is 5 > Jul 11 13:17:59 corosync [CMAN ] daemon: Returning command data. length > = 0 > Jul 11 13:17:59 corosync [CMAN ] daemon: sending reply 40000005 to fd > 23 > <snip> > Back in fenced.log: > Jul 11 13:18:05 fenced daemon cpg_join error retrying > Jul 11 13:18:15 fenced daemon cpg_join error retrying > Jul 11 13:18:21 fenced daemon cpg_join error 2 > Jul 11 13:18:23 fenced cpg_leave fenced:daemon ... > Jul 11 13:18:23 fenced daemon cpg_leave error 9 > > And in /var/log/messages: > Jul 11 13:17:50 server3 corosync[31116]: [SERV ] Service engine > loaded: corosync cluster quorum service v0.1 > Jul 11 13:17:50 server3 corosync[31116]: [MAIN ] Compatibility mode > set to whitetank. Using V1 and V2 of the synchronization engine. > Jul 11 13:17:50 server3 corosync[31116]: [TOTEM ] A processor joined > or left the membership and a new membership was formed. > Jul 11 13:17:50 server3 corosync[31116]: [QUORUM] Members[1]: 3 > Jul 11 13:17:50 server3 corosync[31116]: [QUORUM] Members[1]: 3 > Jul 11 13:17:50 server3 ntpd[1747]: synchronized to 10.135.136.17, > stratum 1 > Jul 11 13:17:50 server3 corosync[31116]: [CPG ] chosen downlist: > sender r(0) ip(10.130.32.32) ; members(old:0 left:0) > Jul 11 13:17:50 server3 corosync[31116]: [MAIN ] Completed service > synchronization, ready to provide service. > Jul 11 13:17:50 server3 corosync[31116]: [TOTEM ] A processor joined > or left the membership and a new membership was formed. > Jul 11 13:17:50 server3 corosync[31116]: [CMAN ] quorum regained, > resuming activity > Jul 11 13:17:50 server3 corosync[31116]: [QUORUM] This node is within > the primary component and will provide service. > Jul 11 13:17:50 server3 corosync[31116]: [QUORUM] Members[2]: 2 3 > Jul 11 13:17:50 server3 corosync[31116]: [QUORUM] Members[2]: 2 3 > Jul 11 13:17:54 server3 corosync[31116]: [TOTEM ] FAILED TO RECEIVE > Jul 11 13:17:54 server3 fenced[31174]: fenced 3.0.12.1 started > Jul 11 13:17:55 server3 dlm_controld[31192]: dlm_controld 3.0.12.1 > started > Jul 11 13:18:05 server3 dlm_controld[31192]: daemon cpg_join error > retrying > Jul 11 13:18:05 server3 fenced[31174]: daemon cpg_join error retrying > Jul 11 13:18:05 server3 gfs_controld[31264]: gfs_controld 3.0.12.1 > started > Jul 11 13:18:15 server3 dlm_controld[31192]: daemon cpg_join error > retrying > Jul 11 13:18:15 server3 fenced[31174]: daemon cpg_join error retrying > Jul 11 13:18:15 server3 gfs_controld[31264]: daemon cpg_join error > retrying > Jul 11 13:18:19 server3 abrtd: Directory > 'ccpp-2012-07-11-13:18:18-31116' creation detected > Jul 11 13:18:19 server3 abrt[31313]: Saved core dump of pid 31116 > (/usr/sbin/corosync) to /var/spool/abrt/ccpp-2012-07-11-13:18:18-31116 > (47955968 > Jul 11 13:18:21 server3 dlm_controld[31192]: daemon cpg_join error 2 > Jul 11 13:18:21 server3 gfs_controld[31264]: daemon cpg_join error 2 > Jul 11 13:18:21 server3 fenced[31174]: daemon cpg_join error 2 > Jul 11 13:18:23 server3 kernel: dlm: closing connection to node 3 > Jul 11 13:18:23 server3 kernel: dlm: closing connection to node 2 > Jul 11 13:18:23 server3 dlm_controld[31192]: daemon cpg_leave error 9 > Jul 11 13:18:23 server3 gfs_controld[31264]: daemon cpg_leave error 9 > Jul 11 13:18:23 server3 fenced[31174]: daemon cpg_leave error 9 > Jul 11 13:18:30 server3 abrtd: Sending an email... > Jul 11 13:18:30 server3 abrtd: Email was sent to: root@localhost > Jul 11 13:18:30 server3 abrtd: Duplicate: UUID > Jul 11 13:18:30 server3 abrtd: DUP_OF_DIR: > /var/spool/abrt/ccpp-2012-07-06-10:30:40-22107 > Jul 11 13:18:30 server3 abrtd: Problem directory is a duplicate of > /var/spool/abrt/ccpp-2012-07-06-10:30:40-22107 > Jul 11 13:18:30 server3 abrtd: Deleting problem directory > ccpp-2012-07-11-13:18:18-31116 (dup of ccpp-2012-07-06-10:30:40-22107) > > > Any tips much appreciated. > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster