What does 'shr289.cup.hp.com' and 'shr295.cup.hp.com' resolve to? Does your switch support multicast properly? If the switch periodically tears down a multicast group, your cluster will partition. You *must* have fencing configured. Fencing using iLO works fine, please use it. See https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Example_.3Cfencedevice....3E_Tag_For_HP_iLO Without fencing, you cluster will be unstable. Digimer On 06/01/2012 01:53 PM, Chen, Ming Ming wrote: > Thanks for returning my email. The cluster configuration file and network configuration. Also one bad news is that the original issues come back again. > So I've see two problems, and both problems will come sporatically: > Thanks again for your help. > Regards > Ming > > 1. The original one. I've increased the version number, and it was gone for a while, but come back. Do you know why? > > May 31 09:08:05 shr295 corosync[3542]: [MAIN ] Completed service synchronizat >>> ion, ready to provide service. >>> May 31 09:08:05 shr295 corosync[3542]: [TOTEM ] A processor joined or left the >>> membership and a new membership was formed. >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Unable to load new config in c >>> orosync: New configuration version has to be newer than current running configur >>> ation >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Can't get updated config versi >>> on 4: New configuration version has to be newer than current running configurati >>> on#012. >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Activity suspended on this nod >>> e >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Error reloading the configurat >>> ion, will retry every second >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Node 1 conflict, remote config >>> version id=4, local=2 >>> -- VISUAL BLOCK --r295 corosync[3542]: [CMAN ] Unable to load new config in c >>> orosync: New configuration version has to be newer than current running configur >>> ation >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Can't get updated config versi >>> on 4: New configuration version has to be newer than current running configurati >>> on#012. > > 2. > [root@shr295 ~]# tail -f /var/log/messages >> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started >> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started >> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying >> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying >> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying > > Cluster configuration File: >>> <?xml version="1.0"?> >>> <cluster config_version="2" name="vmcluster"> >>> <logging debug="on"/> >>> <cman expected_votes="1" two_node="1"/> >>> <clusternodes> >>> <clusternode name="shr289.cup.hp.com" nodeid="1"> >>> <fence> >>> </fence> >>> </clusternode> >>> <clusternode name="shr295.cup.hp.com" nodeid="2"> >>> <fence> >>> </fence> >>> </clusternode> >>> </clusternodes> >>> <fencedevices> >>> </fencedevices> >>> <rm> >>> </rm> >>> </cluster> > > I had a fencing configuration there, but I'd like to see that I can bring up a simple cluster first, then will add the fencing there. > > The network configuration: > eth1 Link encap:Ethernet HWaddr 00:23:7D:36:05:20 > inet addr:16.89.112.182 Bcast:16.89.119.255 Mask:255.255.248.0 > inet6 addr: fe80::223:7dff:fe36:520/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:1210316 errors:0 dropped:0 overruns:0 frame:0 > TX packets:73158 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:150775766 (143.7 MiB) TX bytes:11749950 (11.2 MiB) > Interrupt:16 Memory:f6000000-f6012800 > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:291 errors:0 dropped:0 overruns:0 frame:0 > TX packets:291 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:38225 (37.3 KiB) TX bytes:38225 (37.3 KiB) > > virbr0 Link encap:Ethernet HWaddr 52:54:00:30:33:BD > inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:488 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:0 (0.0 b) TX bytes:25273 (24.6 KiB) > > > -----Original Message----- > From: Digimer [mailto:lists@xxxxxxxxxx] > Sent: Thursday, May 31, 2012 7:05 PM > To: Chen, Ming Ming > Cc: linux clustering > Subject: Re: Help needed > > Send your cluster.conf please, editing only password please. Please also > include you network configs. > > On 05/31/2012 08:12 PM, Chen, Ming Ming wrote: >> Hi Digimer, >> Thanks for your comment. I've got rid of the first problem, and now I have the following messages. Any idea? >> Thanks in advance. >> Ming >> >> [root@shr295 ~]# tail -f /var/log/messages >> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started >> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started >> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying >> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying >> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying >> >> -----Original Message----- >> From: Digimer [mailto:lists@xxxxxxxxxx] >> Sent: Thursday, May 31, 2012 10:13 AM >> To: Chen, Ming Ming >> Cc: linux clustering >> Subject: Re: Help needed >> >> On 05/31/2012 12:27 PM, Chen, Ming Ming wrote: >>> Hi, I have the following simple cluster config just to try out on SertOS 6.2 >>> >>> <?xml version="1.0"?> >>> <cluster config_version="2" name="vmcluster"> >>> <logging debug="on"/> >>> <cman expected_votes="1" two_node="1"/> >>> <clusternodes> >>> <clusternode name="shr289.cup.hp.com" nodeid="1"> >>> <fence> >>> </fence> >>> </clusternode> >>> <clusternode name="shr295.cup.hp.com" nodeid="2"> >>> <fence> >>> </fence> >>> </clusternode> >>> </clusternodes> >>> <fencedevices> >>> </fencedevices> >>> <rm> >>> </rm> >>> </cluster> >>> >>> >>> And I got the following error message when I did "service cman start" I got the same messages on both nodes. >>> Any help will be appreciated. >>> >>> May 31 09:08:04 corosync [TOTEM ] RRP multicast threshold (100 problem count) >>> May 31 09:08:05 shr295 corosync[3542]: [MAIN ] Completed service synchronizat >>> ion, ready to provide service. >>> May 31 09:08:05 shr295 corosync[3542]: [TOTEM ] A processor joined or left the >>> membership and a new membership was formed. >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Unable to load new config in c >>> orosync: New configuration version has to be newer than current running configur >>> ation >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Can't get updated config versi >>> on 4: New configuration version has to be newer than current running configurati >>> on#012. >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Activity suspended on this nod >>> e >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Error reloading the configurat >>> ion, will retry every second >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Node 1 conflict, remote config >>> version id=4, local=2 >>> -- VISUAL BLOCK --r295 corosync[3542]: [CMAN ] Unable to load new config in c >>> orosync: New configuration version has to be newer than current running configur >>> ation >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Can't get updated config versi >>> on 4: New configuration version has to be newer than current running configurati >>> on#012. >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Activity suspended on this nod >>> E >>> >> >> Run 'cman_tool version' to get the current version of the configuration, >> then increase the config_version="x" to be one higher. >> >> Also, configure fencing! If you don't, your cluster will hang the first >> time anything goes wrong. >> >> -- >> Digimer >> Papers and Projects: https://alteeve.com > > > -- > Digimer > Papers and Projects: https://alteeve.com -- Digimer Papers and Projects: https://alteeve.com -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster