Re: Cluster environment issue

Kaloyan Kovachev <kkovachev@xxxxxxxxx> · Tue, 31 May 2011 12:18:51 +0300

On Mon, 30 May 2011 18:22:00 -0700 (PDT), Srija <swap_project@xxxxxxxxx>
wrote:
> Thanks for your quick reply.
> 
> I talked to the network people , but they are saying everything is good
at
> their end. Is there anyway at the server end, to figure it  for the
switch
> restart or multicast traffic? 
> 

If it is a switch restart you will have in your logs the interface going
down/up, but more problematic is to find a short drop of the multicast
traffic (even with a ping script you may miss it) which is more likely the
case, as your cluster is working fine, but suddenly looses connection to
all nodes at the same time.
You may ask the network people to check for STP changes and double check
the multicast configuration and you may also try to use broadcast instead
of multicast or use a dedicated switch.

> I think you have already checked the cluster.conf file.. Except quorum
> disk, do you think that the cluster configuration is sufficient for
> handling the sixteen node cluster!! 
> 

The config is OK ... probably add specific multicast address in the cman
section to avoid surprises, but the default is also fine.

To confirm it is a multicast drop (if you are lucky not ti miss it) - on
just one of the nodes enable icmp broadcasts:
	echo 0 >/proc/sys/net/ipv4/icmp_echo_ignore_broadcasts
then ping from another node, check if just a single one replies (change to
your interface and multicast address)
	ping -I ethX -b -L 239.x.x.x -c 1
and finaly run this script until the cluster gets broken

	while [ $((`ping -I ethX -w 1 -b -L 239.x.x.x -c 1 | grep -c ' 0% packet
loss'`)) -eq 1 ]; do sleep 1; done; echo "missed ping at `date`"

if you get 'missed ping' at the same when cluster goes down - it is
confirmed :)

> thanks again .
> regards
> 
> --- On Mon, 5/30/11, Kaloyan Kovachev <kkovachev@xxxxxxxxx> wrote:
> 
>> From: Kaloyan Kovachev <kkovachev@xxxxxxxxx>
>> Subject: Re:  Cluster environment issue
>> To: "linux clustering" <linux-cluster@xxxxxxxxxx>
>> Date: Monday, May 30, 2011, 4:05 PM
>> Hi,
>>  when your cluster gets broken, most likely the reason is,
>> there is a
>> network problem (switch restart or multicast traffic is
>> lost for a while)
>> on the interface where serverX-priv IPs are configured.
>> Having a quorum
>> disk may help by giving a quorum vote to one of the
>> servers, so it can
>> fence the others, but the best thing to do is to fix your
>> network and
>> preferably add a redundant link for the cluster
>> communication to avoid
>> breakage in the first place
>> 
>> On Mon, 30 May 2011 12:17:07 -0700 (PDT), Srija
<swap_project@xxxxxxxxx>
>> wrote:
>> > Hi,
>> > 
>> > I am very new to the redhat cluster. Need some help
>> and suggession for
>> the
>> > cluster configuration.
>> > We have sixteen node cluster of 
>> > 
>> >             OS
>> : Linux Server release 5.5 (Tikanga)
>> >         
>>    kernel :  2.6.18-194.3.1.el5xen.
>> > 
>> > The problem is sometimes the cluster is getting 
>> broken. The solution is
>> > (still yet)to reboot the 
>> > sixteen nodes. Otherwise the nodes are not joining
>> > 
>> > We are using  clvm and not using any quorum disk.
>> The quorum is by
>> default.
>> > 
>> > When it is getting broken, clustat commands
>> shows  evrything  offline
>> > except the node from where
>> > the clustat command executed.  If we execute vgs,
>> lvs command, those
>> > commands are getting hung.
>> > 
>> > Here is at present the clustat report
>> > -------------------------------------
>> > 
>> > [server1]# clustat
>> > Cluster Status for newcluster @ Mon May 30 14:55:10
>> 2011
>> > Member Status: Quorate
>> > 
>> >  Member Name         
>>            
>> ID   Status
>> >  ------ ----         
>>             ---- ------
>> >  server1           
>>               1 Online
>> >  server2           
>>               2 Online,
>> Local
>> >  server3           
>>               3 Online
>> >  server4           
>>               4 Online
>> >  server5           
>>               5 Online
>> >  server6           
>>               6 Online
>> >  server7           
>>               7 Online
>> >  server8           
>>               8 Online
>> >  server9           
>>               9 Online
>> >  server10         
>>            
>>    10 Online
>> >  server11         
>>            
>>    11 Online
>> >  server12         
>>            
>>    12 Online
>> >  server13         
>>            
>>    13 Online
>> >  server14         
>>            
>>    14 Online
>> >  server15         
>>            
>>    15 Online
>> >  server16         
>>            
>>    16 Online
>> > 
>> > Here the cman_tool status  output  from one
>> server
>> > --------------------------------------------------
>> > 
>> > [server1 ~]# cman_tool status
>> > Version: 6.2.0
>> > Config Version: 23
>> > Cluster Name: newcluster
>> > Cluster Id: 53322
>> > Cluster Member: Yes
>> > Cluster Generation: 11432
>> > Membership state: Cluster-Member
>> > Nodes: 16
>> > Expected votes: 16
>> > Total votes: 16
>> > Quorum: 9  
>> > Active subsystems: 8
>> > Flags: Dirty 
>> > Ports Bound: 0 11  
>> > Node name: server1
>> > Node ID: 1
>> > Multicast addresses: xxx.xxx.xxx.xx 
>> > Node addresses: 192.168.xxx.xx 
>> > 
>> > 
>> > Here is the cluster.conf file.
>> > ------------------------------
>> > 
>> > <?xml version="1.0"?>
>> > <cluster alias="newcluster" config_version="23"
>> name="newcluster">
>> > <fence_daemon clean_start="1" post_fail_delay="0"
>> post_join_delay="15"/>
>> > 
>> > <clusternodes>
>> > 
>> > <clusternode name="server1-priv" nodeid="1"
>> votes="1">
>> >               
>>   <fence><method name="1">
>> >               
>>   <device name="ilo-server1r"/></method>
>> >               
>>   </fence>
>> > </clusternode>
>> > 
>> > <clusternode name="server2-priv" nodeid="3"
>> votes="1">
>> >     
>>    <fence><method name="1">
>> >         <device
>> name="ilo-server2r"/></method>
>> >         </fence>
>> > </clusternode>
>> > 
>> > <clusternode name="server3-priv" nodeid="2"
>> votes="1">
>> >     
>>    <fence><method name="1">
>> >         <device
>> name="ilo-server3r"/></method>
>> >         </fence>
>> > </clusternode>
>> > 
>> > [ ... sinp .....]
>> > 
>> > <clusternode name="server16-priv" nodeid="16"
>> votes="1">
>> >        <fence><method
>> name="1">
>> >        <device
>> name="ilo-server16r"/></method>
>> >        </fence>
>> > </clusternode>
>> > 
>> > </clusternodes>
>> > <cman/>
>> > 
>> > <dlm plock_ownership="1" plock_rate_limit="0"/>
>> > <gfs_controld plock_rate_limit="0"/>
>> > 
>> > <fencedevices>
>> >         <fencedevice
>> agent="fence_ilo" hostname="server1r" login="Admin"
>> >     
>>    name="ilo-server1r" passwd="xxxxx"/>
>> >         ..........
>> >         <fencedevice
>> agent="fence_ilo" hostname="server16r"
>> login="Admin"
>> >     
>>    name="ilo-server16r" passwd="xxxxx"/>
>> > </fencedevices>
>> > <rm>
>> > <failoverdomains/>
>> > <resources/>
>> > </rm></cluster>
>> > 
>> > Here is the lvm.conf file
>> > --------------------------
>> > 
>> > devices {
>> > 
>> >     dir = "/dev"
>> >     scan = [ "/dev" ]
>> >     preferred_names = [ ]
>> >     filter = [
>> "r/scsi.*/","r/pci.*/","r/sd.*/","a/.*/" ]
>> >     cache_dir = "/etc/lvm/cache"
>> >     cache_file_prefix = ""
>> >     write_cache_state = 1
>> >     sysfs_scan = 1
>> >     md_component_detection = 1
>> >     md_chunk_alignment = 1
>> >     data_alignment_detection = 1
>> >     data_alignment = 0
>> > 
>>    data_alignment_offset_detection = 1
>> >     ignore_suspended_devices = 0
>> > }
>> > 
>> > log {
>> > 
>> >     verbose = 0
>> >     syslog = 1
>> >     overwrite = 0
>> >     level = 0
>> >     indent = 1
>> >     command_names = 0
>> >     prefix = "  "
>> > }
>> > 
>> > backup {
>> > 
>> >     backup = 1
>> >     backup_dir =
>> "/etc/lvm/backup"
>> >     archive = 1
>> >     archive_dir =
>> "/etc/lvm/archive"
>> >     retain_min = 10
>> >     retain_days = 30
>> > }
>> > 
>> > shell {
>> > 
>> >     history_size = 100
>> > }
>> > global {
>> >     library_dir = "/usr/lib64"
>> >     umask = 077
>> >     test = 0
>> >     units = "h"
>> >     si_unit_consistency = 0
>> >     activation = 1
>> >     proc = "/proc"
>> >     locking_type = 3
>> >     wait_for_locks = 1
>> >     fallback_to_clustered_locking
>> = 1
>> >     fallback_to_local_locking = 1
>> >     locking_dir = "/var/lock/lvm"
>> >     prioritise_write_locks = 1
>> > }
>> > 
>> > activation {
>> >     udev_sync = 1
>> >     missing_stripe_filler =
>> "error"
>> >     reserved_stack = 256
>> >     reserved_memory = 8192
>> >     process_priority = -18
>> >     mirror_region_size = 512
>> >     readahead = "auto"
>> >     mirror_log_fault_policy =
>> "allocate"
>> >     mirror_image_fault_policy =
>> "remove"
>> > }
>> > dmeventd {
>> > 
>> >     mirror_library =
>> "libdevmapper-event-lvm2mirror.so"
>> >     snapshot_library =
>> "libdevmapper-event-lvm2snapshot.so"
>> > }
>> > 
>> > 
>> > If you need more  information,  I can
>> provide ...
>> > 
>> > Thanks for your help
>> > Priya
>> > 
>> > --
>> > Linux-cluster mailing list
>> > Linux-cluster@xxxxxxxxxx
>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster@xxxxxxxxxx
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster