OS: RHEL4 Update 4
Kernel: 2.6.9-42.ELsmp
Cluster: RhCS4 Update4, RHGFS4 U4(GFS-6.1.6-1)
Multipath: EMCpower.LINUX-4.5.1-022
Storage: Fibre channel with EMC CX-320
Fence Device: DELL DRAC5
Service: Postfix, Courier-imap
nodeA.example.com: 192.168.0.20
nodeB.example.com: 192.168.0.60
Drac5(nodeA): 192.168.0.121
Drac5(nodeB); 192.168.0.161
I have 2 node using gfs cluster and powerpath connect through fibre to
EMC-CX-320 Storage.
both node use drac5 as fence device
Heartbeat traffice use same interface as normal traffic(Mail,imap/pop3)
Problem is only NodeB alway fenced NodeA with reason "Missed too many
heartbeats"
After NodeA was rebooted system can join cluster again and working fine
until nodeB start fence again, May be
4-5 hour or 6-7 hour later.
This happen in random manner 2-3 time per day
Memory,Cpu,i/o look good and Traffice not peak during problem have occured
(from sar, and mrtg)
no drop, no collision from ifconfig command
In logfile show same messages every time nodeB start fenced NodeA
I try to extend heartbeat interval by change "deadnode_timeout" from 21 to
61 but doesn't help
Have anyway to solve this problem or enable more debuging ?
Do i have to dedicate network card to separte heartbeat and normal traffic ?
###### /var/log/message
Aug 7 21:50:06 nodeB kernel: CMAN: removing node nodeA.example.com from the
cluster : Missed too many
heartbeats
Aug 7 21:50:06 nodeB fenced[20770]: nodeA.example.com not a cluster member
after 0 sec post_fail_delay
Aug 7 21:50:06 nodeB fenced[20770]: fencing node "nodeA.example.com"
Aug 7 21:50:15 nodeB fenced[20770]: fence "nodeA.example.com" success
Aug 7 21:50:22 nodeB kernel: GFS: fsid=bkkair_cluster:gfs01.1: jid=0:
Trying to acquire journal lock...
Aug 7 21:50:22 nodeB kernel: GFS: fsid=bkkair_cluster:gfs01.1: jid=0:
Looking at journal...
Aug 7 21:50:22 nodeB kernel: GFS: fsid=bkkair_cluster:gfs01.1: jid=0: Done
Aug 7 21:53:36 nodeB kernel: CMAN: node nodeA.example.com rejoining
###### /etc/cluster/cluster.conf ################
<?xml version="1.0" ?>
<cluster config_version="7" name="bkkair_cluster">
<fence_daemon post_fail_delay="0" post_join_delay="15"/>
<clusternodes>
<clusternode name="nodeA.example.com" votes="1">
<fence>
<method name="1">
<device modulename=""
name="DRAC-nodeA"/>
</method>
</fence>
</clusternode>
<clusternode name="nodeB.example.com" votes="1">
<fence>
<method name="1">
<device modulename=""
name="DRAC-nodeB"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<cman deadnode_timeout="61"/>
<fencedevices>
<fencedevice agent="fence_drac" ipaddr="192.168.0.121"
login="root" name="DRAC-nodeA"
passwd="supervis"/>
<fencedevice agent="fence_drac" ipaddr="192.168.0.161"
login="root" name="DRAC-nodeB"
passwd="supervis"/>
</fencedevices>
<rm>
<failoverdomains/>
<resources/>
</rm>
</cluster>
#####################################################
Regards,
Nattapon
_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar - get it now!
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster