Hi, On Thu, 2 Jun 2011 08:37:07 -0700 (PDT), Srija <swap_project@xxxxxxxxx> wrote: > Thank you so much for your reply again. > > --- On Tue, 5/31/11, Kaloyan Kovachev <kkovachev@xxxxxxxxx> wrote: > Thanks for your reply again. > > > > >> If it is a switch restart you will have in your logs the >> interface going >> down/up, but more problematic is to find a short drop of >> the multicast > > I checked all nodes did not find anything about interface, but in all the > nodes it is reporting that server19(node 12) /server18 (node 11) is the > problematic, here I am mentioning the logs from three nodes (out of 16 > nodes) > > May 24 18:04:59 server7 openais[6113]: [TOTEM] entering GATHER state > from 12. > May 24 18:05:01 server7 crond[5068]: (root) CMD ( > /opt/hp/hp-health/bin/check-for-restart-requests) > May 24 18:05:19 server7 openais[6113]: [TOTEM] entering GATHER state > from 11. > > May 24 18:04:59 server1 openais[6148]: [TOTEM] entering GATHER state > from 12. > May 24 18:05:01 server1 crond[2275]: (root) CMD ( > /opt/hp/hp-health/bin/check-for-restart-requests) > May 24 18:05:19 server1 openais[6148]: [TOTEM] entering GATHER state > from 11. > > May 24 18:04:59 server8 openais[6279]: [TOTEM] entering GATHER state > from 12. > May 24 18:05:01 server8 crond[11125]: (root) CMD ( > /opt/hp/hp-health/bin/check-for-restart-requests) > May 24 18:05:19 server8 openais[6279]: [TOTEM] entering GATHER state > from 11. > > > Here is some lines from node12 , at the same time > ___________________________________________________ > > > May 24 18:04:59 server19 openais[5950]: [TOTEM] The token was lost in the > OPERATIONAL state. > May 24 18:04:59 server19 openais[5950]: [TOTEM] Receive multicast socket > recv buffer size (320000 bytes). > May 24 18:04:59 server19 openais[5950]: [TOTEM] Transmit multicast socket > send buffer size (262142 bytes). > May 24 18:04:59 server19 openais[5950]: [TOTEM] entering GATHER state from > 2. > May 24 18:05:19 server19 openais[5950]: [TOTEM] entering GATHER state from > 11. > May 24 18:05:20 server19 openais[5950]: [TOTEM] Saving state aru 39a8f > high seq received 39a8f > May 24 18:05:20 server19 openais[5950]: [TOTEM] Storing new sequence id > for ring 2af0 > May 24 18:05:20 server19 openais[5950]: [TOTEM] entering COMMIT state. > May 24 18:05:20 server19 openais[5950]: [TOTEM] entering RECOVERY state. > > > Here is few lines on node11 ie server18 > ------------------------------------------ > > ay 24 18:04:48 server18 > May 24 18:10:14 server18 syslog-ng[5619]: syslog-ng starting up; > version='2.0.10' > May 24 18:10:14 server18 Bootdata ok (command line is ro > root=/dev/vgroot_xen/lvroot rhgb quiet) > > > So it seems that node11 is rebooting just after few mintues we get all > the problems in the logs of all nodes. > > > > You may ask the network people to check for STP changes and >> double check >> the multicast configuration and you may also try to use >> broadcast instead >> of multicast or use a dedicated switch. > > As per the dedicated switch, I don't think it is possible as per the > network team. I asked the STP chanes related. their answer is > > "there are no stp changes for the private network as there are no > redundant devices in the environment. the multicast configs is igmp > snooping with Pim" > > I have talked to the network team for using the broadcast instead of > multicast, as per them , they can set.. > > Pl. comment on this... > to use broadcast (if private addresses are in the same VLAN/subnet) you just need to set it in cluster.conf - cman section, but not sure if it can be done on a running cluster (without stopping or braking it) > > your interface and multicast address) >> ping -I ethX -b -L 239.x.x.x -c 1 >> and finaly run this script until the cluster gets broken > > Yes , I have checked it , it is working fine now. I have also set a cron > for this script and set in one node. no need for cron if you haven't changed the script - this will start several processes and your network will be overloaded !!! the script was made to run on a console (or via screen) and it will exit _only_ when multicast is lost > > I have few questions regarding the cluster configuration ... > > > - We are using clvm in the cluster environment. As I understand it > is active-active. > The environment is xen . all the xen hosts are in the cluster and > each host have > the guests. We are keeping the options to live migrate the guests > from one host to another. > > - I was looking into the redhat knowledgebase > https://access.redhat.com/kb/docs/DOC-3068, > as per the document , what do you think using CLVM or HA-LVM will be > the best choice? > > Pl. advice. can't comment on this sorry > > > Thanks and regards again. > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster