Shr289.cup.hp.com resolves to 16.89.116.32 Shr295.cup.hp.com resolves to 16.89.112.182 I would assume that our switches should support multicast, since we have another cluster RH6.2 which runs OK using the same switch. Also I'll put the fencing in the cluster conf to try it again. Thanks Ming -----Original Message----- From: Digimer [mailto:lists@xxxxxxxxxx] Sent: Friday, June 01, 2012 11:44 AM To: Chen, Ming Ming Cc: linux clustering Subject: Re: Help needed What does 'shr289.cup.hp.com' and 'shr295.cup.hp.com' resolve to? Does your switch support multicast properly? If the switch periodically tears down a multicast group, your cluster will partition. You *must* have fencing configured. Fencing using iLO works fine, please use it. See https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Example_.3Cfencedevice....3E_Tag_For_HP_iLO Without fencing, you cluster will be unstable. Digimer On 06/01/2012 01:53 PM, Chen, Ming Ming wrote: > Thanks for returning my email. The cluster configuration file and network configuration. Also one bad news is that the original issues come back again. > So I've see two problems, and both problems will come sporatically: > Thanks again for your help. > Regards > Ming > > 1. The original one. I've increased the version number, and it was gone for a while, but come back. Do you know why? > > May 31 09:08:05 shr295 corosync[3542]: [MAIN ] Completed service synchronizat >>> ion, ready to provide service. >>> May 31 09:08:05 shr295 corosync[3542]: [TOTEM ] A processor joined or left the >>> membership and a new membership was formed. >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Unable to load new config in c >>> orosync: New configuration version has to be newer than current running configur >>> ation >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Can't get updated config versi >>> on 4: New configuration version has to be newer than current running configurati >>> on#012. >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Activity suspended on this nod >>> e >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Error reloading the configurat >>> ion, will retry every second >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Node 1 conflict, remote config >>> version id=4, local=2 >>> -- VISUAL BLOCK --r295 corosync[3542]: [CMAN ] Unable to load new config in c >>> orosync: New configuration version has to be newer than current running configur >>> ation >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Can't get updated config versi >>> on 4: New configuration version has to be newer than current running configurati >>> on#012. > > 2. > [root@shr295 ~]# tail -f /var/log/messages >> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started >> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started >> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying >> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying >> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying > > Cluster configuration File: >>> <?xml version="1.0"?> >>> <cluster config_version="2" name="vmcluster"> >>> <logging debug="on"/> >>> <cman expected_votes="1" two_node="1"/> >>> <clusternodes> >>> <clusternode name="shr289.cup.hp.com" nodeid="1"> >>> <fence> >>> </fence> >>> </clusternode> >>> <clusternode name="shr295.cup.hp.com" nodeid="2"> >>> <fence> >>> </fence> >>> </clusternode> >>> </clusternodes> >>> <fencedevices> >>> </fencedevices> >>> <rm> >>> </rm> >>> </cluster> > > I had a fencing configuration there, but I'd like to see that I can bring up a simple cluster first, then will add the fencing there. > > The network configuration: > eth1 Link encap:Ethernet HWaddr 00:23:7D:36:05:20 > inet addr:16.89.112.182 Bcast:16.89.119.255 Mask:255.255.248.0 > inet6 addr: fe80::223:7dff:fe36:520/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:1210316 errors:0 dropped:0 overruns:0 frame:0 > TX packets:73158 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:150775766 (143.7 MiB) TX bytes:11749950 (11.2 MiB) > Interrupt:16 Memory:f6000000-f6012800 > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:291 errors:0 dropped:0 overruns:0 frame:0 > TX packets:291 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:38225 (37.3 KiB) TX bytes:38225 (37.3 KiB) > > virbr0 Link encap:Ethernet HWaddr 52:54:00:30:33:BD > inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:488 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:0 (0.0 b) TX bytes:25273 (24.6 KiB) > > > -----Original Message----- > From: Digimer [mailto:lists@xxxxxxxxxx] > Sent: Thursday, May 31, 2012 7:05 PM > To: Chen, Ming Ming > Cc: linux clustering > Subject: Re: Help needed > > Send your cluster.conf please, editing only password please. Please also > include you network configs. > > On 05/31/2012 08:12 PM, Chen, Ming Ming wrote: >> Hi Digimer, >> Thanks for your comment. I've got rid of the first problem, and now I have the following messages. Any idea? >> Thanks in advance. >> Ming >> >> [root@shr295 ~]# tail -f /var/log/messages >> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started >> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started >> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying >> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying >> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying >> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying >> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying >> >> -----Original Message----- >> From: Digimer [mailto:lists@xxxxxxxxxx] >> Sent: Thursday, May 31, 2012 10:13 AM >> To: Chen, Ming Ming >> Cc: linux clustering >> Subject: Re: Help needed >> >> On 05/31/2012 12:27 PM, Chen, Ming Ming wrote: >>> Hi, I have the following simple cluster config just to try out on SertOS 6.2 >>> >>> <?xml version="1.0"?> >>> <cluster config_version="2" name="vmcluster"> >>> <logging debug="on"/> >>> <cman expected_votes="1" two_node="1"/> >>> <clusternodes> >>> <clusternode name="shr289.cup.hp.com" nodeid="1"> >>> <fence> >>> </fence> >>> </clusternode> >>> <clusternode name="shr295.cup.hp.com" nodeid="2"> >>> <fence> >>> </fence> >>> </clusternode> >>> </clusternodes> >>> <fencedevices> >>> </fencedevices> >>> <rm> >>> </rm> >>> </cluster> >>> >>> >>> And I got the following error message when I did "service cman start" I got the same messages on both nodes. >>> Any help will be appreciated. >>> >>> May 31 09:08:04 corosync [TOTEM ] RRP multicast threshold (100 problem count) >>> May 31 09:08:05 shr295 corosync[3542]: [MAIN ] Completed service synchronizat >>> ion, ready to provide service. >>> May 31 09:08:05 shr295 corosync[3542]: [TOTEM ] A processor joined or left the >>> membership and a new membership was formed. >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Unable to load new config in c >>> orosync: New configuration version has to be newer than current running configur >>> ation >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Can't get updated config versi >>> on 4: New configuration version has to be newer than current running configurati >>> on#012. >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Activity suspended on this nod >>> e >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Error reloading the configurat >>> ion, will retry every second >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Node 1 conflict, remote config >>> version id=4, local=2 >>> -- VISUAL BLOCK --r295 corosync[3542]: [CMAN ] Unable to load new config in c >>> orosync: New configuration version has to be newer than current running configur >>> ation >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Can't get updated config versi >>> on 4: New configuration version has to be newer than current running configurati >>> on#012. >>> May 31 09:08:05 shr295 corosync[3542]: [CMAN ] Activity suspended on this nod >>> E >>> >> >> Run 'cman_tool version' to get the current version of the configuration, >> then increase the config_version="x" to be one higher. >> >> Also, configure fencing! If you don't, your cluster will hang the first >> time anything goes wrong. >> >> -- >> Digimer >> Papers and Projects: https://alteeve.com > > > -- > Digimer > Papers and Projects: https://alteeve.com -- Digimer Papers and Projects: https://alteeve.com -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster