Hi People. I have 4 servers under RH Cluster Suite: 2 Clusters - one with oracle, and one with two very important java daemons for local telecom. Sometimes, I can see in log files (remote syslog outside cluster(s)): -------------------------------------------------------- Jul 11 11:56:40 szgtr01 clusvcmgrd: [17188]: <err> service error: IP address 10.100.1.151 missing Jul 11 11:56:40 szgtr01 last message repeated 2 times Jul 11 11:56:40 szgtr01 clusvcmgrd: [17188]: <err> service error: 0: error fetching interface information: Device not found Jul 11 11:56:40 szgtr01 last message repeated 2 times Jul 11 11:56:40 szgtr01 clusvcmgrd: [17188]: <err> service error: Check status failed on IP addresses for tomcat Jul 11 11:56:40 szgtr01 last message repeated 2 times Jul 11 11:56:40 szgtr01 clusvcmgrd[17187]: <warning> Restarting locally failed service tomcat Jul 11 11:56:40 szgtr01 last message repeated 2 times Jul 11 11:56:40 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Stopping service tomcat ... Jul 11 11:56:40 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Stopping service tomcat ... Jul 11 11:56:40 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Running user script '/etc/init.d/tomcat stop' Jul 11 11:56:40 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Running user script '/etc/init.d/tomcat stop' Jul 11 11:56:46 szgtr01 clusvcmgrd: [17440]: <info> service info: Stopping IP address 10.100.1.151 Jul 11 11:56:46 szgtr01 clusvcmgrd: [17440]: <info> service info: Stopping IP address 10.100.1.151 Jul 11 11:56:46 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Stopped service tomcat ... Jul 11 11:56:46 szgtr01 clusvcmgrd: [17440]: <notice> service notice: Stopped service tomcat ... Jul 11 11:56:46 szgtr01 clusvcmgrd[17187]: <notice> Starting stopped service tomcat Jul 11 11:56:46 szgtr01 clusvcmgrd[17187]: <notice> Starting stopped service tomcat Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Starting service tomcat ... Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Starting service tomcat ... Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <info> service info: Starting IP address 10.100.1.151 Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <info> service info: Starting IP address 10.100.1.151 Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <info> service info: Sending Gratuitous arp for 10.100.1.151 (00:12:79:D6:7F:30) Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <info> service info: Sending Gratuitous arp for 10.100.1.151 (00:12:79:D6:7F:30) Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Running user script '/etc/init.d/tomcat start' Jul 11 11:56:46 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Running user script '/etc/init.d/tomcat start' Jul 11 11:56:58 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Started service tomcat ... Jul 11 11:56:58 szgtr01 clusvcmgrd: [17716]: <notice> service notice: Started service tomcat ... -------------------------------------------------------- As I see, this "IP address <foo> missing" is coming from /usr/lib/clumanager/services/svclib_ip script. Why? Nobody is removing service address from interface, but sometimes script (or ifconfig) is failing to find IP address. I didn't hacked anything. Everything is configured by redhat-config-cluster graphic tool BTW. Cluster has 2 shared raw devices (sda1 and sdb1) from HP SAN (EVA) for configuration. I have WTI NPS power switches ... This is IPv4 configuration of the maschine is following: -------------------------------------------------------- 1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 brd 127.255.255.255 scope host lo 2: bond0: <BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue link/ether 00:12:79:d6:7f:30 brd ff:ff:ff:ff:ff:ff inet 10.100.1.20/24 brd 10.100.1.255 scope global bond0 inet 10.100.1.152/24 brd 10.100.1.255 scope global secondary bond0:0 inet 10.100.1.151/24 brd 10.100.1.255 scope global secondary bond0:1 3: eth0: <BROADCAST,MULTICAST,NOARP,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond1 qlen 1000 link/ether 00:12:79:d6:7f:31 brd ff:ff:ff:ff:ff:ff inet 10.100.252.20/24 brd 10.100.252.255 scope global eth0 4: eth1: <BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000 link/ether 00:12:79:d6:7f:30 brd ff:ff:ff:ff:ff:ff inet 10.100.1.20/24 brd 10.100.1.255 scope global eth1 5: eth2: <BROADCAST,MULTICAST,NOARP,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000 link/ether 00:12:79:d6:7f:30 brd ff:ff:ff:ff:ff:ff inet 10.100.1.20/24 brd 10.100.1.255 scope global eth2 6: eth3: <BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond1 qlen 1000 link/ether 00:12:79:d6:7f:31 brd ff:ff:ff:ff:ff:ff inet 10.100.252.20/24 brd 10.100.252.255 scope global eth3 7: bond1: <BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue link/ether 00:12:79:d6:7f:31 brd ff:ff:ff:ff:ff:ff inet 10.100.252.20/24 brd 10.100.252.255 scope global bond1 -------------------------------------------------------- Network 10.100.1.0/24 on bond0 is data and heartbeat link, and net 10.100.252.0/24 on bond1 is special vlan/network for communication with WTI NPS network power switches. Every machine in cluster has 2 network cards - gigabit Broadcom and gigabit intel card with bcm5700 and e1000 drivers. Cluster node-systems are Red Hat Advanced Server 3. All machines are updated from RHN to update 5. Clumanager package is 1.2.26.1-1. This is cluster.xml on the first cluster: ------------------------------------------------------------------ # strings /dev/sda1 uszgtr01.tel.local 0/usr/sbin/clusvcmgrd 0/usr/sbin/clusvcmgrd p/usr/sbin/clusvcmgrd v%bE 8{+No! Qwf? UT!u S-R*. @&e: '&gs k/wB ix.x <?xml version="1.0"?> <cluconfig version="3.0"> <clumembd broadcast="no" interval="750000" loglevel="6" multicast="yes" multicast_ipaddress="225.0.0.11" thread="yes" tko_count="20"/> <cluquorumd loglevel="6" pinginterval="" tiebreaker_ip="10.100.1.1"/> <clurmtabd loglevel="6" pollinterval="4"/> <clusvcmgrd loglevel="6"/> <clulockd loglevel="6"/> <cluster config_viewnumber="10" key="55c6b6814c16718ea1728bdfcea5cf78" name="java"/> <sharedstate driver="libsharedraw.so" rawprimary="/dev/raw/raw1" rawshadow="/dev/raw/raw2" type="raw"/> <members> <member id="0" name="szgtr01" watchdog="no"> <powercontroller id="0" ipaddress="10.100.252.222" password="xxxxxxx" port="1" type="wti_nps" user=""/> <powercontroller id="1" ipaddress="10.100.252.223" password="xxxxxxx" port="1" type="wti_nps" user=""/> </member> <member id="1" name="szgtr02" watchdog="no"> <powercontroller id="0" ipaddress="10.100.252.222" password="xxxxxxxx" port="5" type="wti_nps" user=""/> <powercontroller id="1" ipaddress="10.100.252.223" password="xxxxxxxx" port="5" type="wti_nps" user=""/> </member> </members> <services> <service checkinterval="8" failoverdomain="javadom" id="0" maxfalsestarts="0" maxrestarts="0" name="tomcat" userscript="/etc/init.d/tomcat"> <service_ipaddresses> <service_ipaddress broadcast="10.100.1.255" id="0" ipaddress="10.100.1.151" netmask="255.255.255.0"/> </service_ipaddresses> </service> <service checkinterval="8" failoverdomain="javadom" id="1" maxfalsestarts="0" maxrestarts="0" name="rad" userscript="/etc/init.d/radiusd"> <service_ipaddresses> <service_ipaddress broadcast="10.100.1.255" id="0" ipaddress="10.100.1.152" netmask="255.255.255.0"/> </service_ipaddresses> </service> </services> <failoverdomains> <failoverdomain id="0" name="javadom" ordered="yes" restricted="yes"> <failoverdomainnode id="0" name="szgtr01"/> <failoverdomainnode id="1" name="szgtr02"/> </failoverdomain> </failoverdomains> </cluconfig> ------------------------------------------------------------------------ lsmod: ------------------------------------------------------------------------ Module Size Used by iptable_filter 2412 0 (autoclean) (unused) ip_tables 16544 1 [iptable_filter] cpqci 28612 3 audit 90808 3 bonding1 25156 1 e1000 83784 2 bcm5700 110564 2 bonding 25156 1 microcode 6912 0 (autoclean) keybdev 2976 0 (unused) mousedev 5688 0 (unused) hid 22532 0 (unused) input 6176 0 [keybdev mousedev hid] ehci-hcd 20776 0 (unused) usb-uhci 26860 0 (unused) usbcore 81152 1 [hid ehci-hcd usb-uhci] ext3 89960 3 jbd 55156 3 [ext3] sg 37324 0 qla2300 590844 9 qla2300_conf 301560 0 cciss 45188 4 sd_mod 14128 8 scsi_mod 115496 3 [sg qla2300 cciss sd_mod] ------------------------------------------------------------------------ I have set up this script to watch ifconfig output: while `usleep 500000` do ifconfig bond0:1;echo "----------------------------" done >> /tmp/ifconfig.log & After 4-5 hours, i have one failure: # grep addr:10.100.1.151 ifconfig.log | wc -l 23207 # grep 'HWaddr 00:12:79:D6:7F:30' ifconfig.log | wc -l 23208 Can somebody help me with this IP network and occationaly missing service IP address? Thanks ... P.S. Service failures are very randoom. Sometimes 2-3 in one day. Sometimes only one weekly ... but this is not acceptible by my customer. :-( P.P.S. Situation (bug) is the same on all 4 cluster nodes. HW is HP ProLiant DL 380 with hotswapable SCSI discs in HW RAID1, 2 CPUs each, and 12 GB RAM. -- Miroslav Zubcic, RHCE, Nimium d.o.o., email: <mvz@xxxxxxxxx> Tel: +385 01 4852 639, Fax: +385 01 4852 640, Mobile: +385 098 942 8672 Mrazoviceva 12, 10000 Zagreb, Hrvatska -- Linux-cluster@xxxxxxxxxx http://www.redhat.com/mailman/listinfo/linux-cluster