Thank you so much for your reply again. --- On Tue, 5/31/11, Kaloyan Kovachev <kkovachev@xxxxxxxxx> wrote: Thanks for your reply again. > > If it is a switch restart you will have in your logs the > interface going > down/up, but more problematic is to find a short drop of > the multicast I checked all nodes did not find anything about interface, but in all the nodes it is reporting that server19(node 12) /server18 (node 11) is the problematic, here I am mentioning the logs from three nodes (out of 16 nodes) May 24 18:04:59 server7 openais[6113]: [TOTEM] entering GATHER state from 12. May 24 18:05:01 server7 crond[5068]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests) May 24 18:05:19 server7 openais[6113]: [TOTEM] entering GATHER state from 11. May 24 18:04:59 server1 openais[6148]: [TOTEM] entering GATHER state from 12. May 24 18:05:01 server1 crond[2275]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests) May 24 18:05:19 server1 openais[6148]: [TOTEM] entering GATHER state from 11. May 24 18:04:59 server8 openais[6279]: [TOTEM] entering GATHER state from 12. May 24 18:05:01 server8 crond[11125]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests) May 24 18:05:19 server8 openais[6279]: [TOTEM] entering GATHER state from 11. Here is some lines from node12 , at the same time ___________________________________________________ May 24 18:04:59 server19 openais[5950]: [TOTEM] The token was lost in the OPERATIONAL state. May 24 18:04:59 server19 openais[5950]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). May 24 18:04:59 server19 openais[5950]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). May 24 18:04:59 server19 openais[5950]: [TOTEM] entering GATHER state from 2. May 24 18:05:19 server19 openais[5950]: [TOTEM] entering GATHER state from 11. May 24 18:05:20 server19 openais[5950]: [TOTEM] Saving state aru 39a8f high seq received 39a8f May 24 18:05:20 server19 openais[5950]: [TOTEM] Storing new sequence id for ring 2af0 May 24 18:05:20 server19 openais[5950]: [TOTEM] entering COMMIT state. May 24 18:05:20 server19 openais[5950]: [TOTEM] entering RECOVERY state. Here is few lines on node11 ie server18 ------------------------------------------ ay 24 18:04:48 server18 May 24 18:10:14 server18 syslog-ng[5619]: syslog-ng starting up; version='2.0.10' May 24 18:10:14 server18 Bootdata ok (command line is ro root=/dev/vgroot_xen/lvroot rhgb quiet) So it seems that node11 is rebooting just after few mintues we get all the problems in the logs of all nodes. > You may ask the network people to check for STP changes and > double check > the multicast configuration and you may also try to use > broadcast instead > of multicast or use a dedicated switch. As per the dedicated switch, I don't think it is possible as per the network team. I asked the STP chanes related. their answer is "there are no stp changes for the private network as there are no redundant devices in the environment. the multicast configs is igmp snooping with Pim" I have talked to the network team for using the broadcast instead of multicast, as per them , they can set.. Pl. comment on this... > your interface and multicast address) > ping -I ethX -b -L 239.x.x.x -c 1 > and finaly run this script until the cluster gets broken Yes , I have checked it , it is working fine now. I have also set a cron for this script and set in one node. I have few questions regarding the cluster configuration ... - We are using clvm in the cluster environment. As I understand it is active-active. The environment is xen . all the xen hosts are in the cluster and each host have the guests. We are keeping the options to live migrate the guests from one host to another. - I was looking into the redhat knowledgebase https://access.redhat.com/kb/docs/DOC-3068, as per the document , what do you think using CLVM or HA-LVM will be the best choice? Pl. advice. Thanks and regards again. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster