[UPDATE] IP monitor failing periodically

Chris Harms <chris@xxxxxxxxxxx> · Sat, 21 Jul 2007 15:41:04 -0500

We reinstalled our machines with RHEL 5 x86_64 (we were running i386) a 
few weeks ago and the mysterious IP monitoring failures have disappeared. 
I believe it was postulated that a compiler bug regarding -fpie might be 
causing segfaults in i386 binaries, so this would support that theory to 
some degree, although I did not really attempt to confirm it further.  I 
thought the architecture change fixing the random failovers was noteworthy.

### previous thread below

Hi Chris,

I am experiencing the same problem on RHEL 5 and I have a support 
request in with RedHat.

I was asked to increase the debug level by changing the <rm> line in the 
cluster configuration to:

<rm log_facility="local4" log_level="7">

I then needed to add "local4.* /var/log/cluster" to /etc/syslog.conf and 
run "service syslog restart".

To update the cluster configuration I needed to propagate the cluster 
configuration to both nodes:

# ccs_tool update /etc/cluster/cluster.conf

After a week I have not had the problem with the increased logging 
despite the problem occurring regularly prior to that - 2 to 3 times a 
day. One day last week out of curiosity I reverted to the default 
settings and within a few hours I had the failure to ping error on one 
of the clustered IP addresses and the service was restarted.

I now have the logging back at 7 and the support request is pending.

Regards
--
David Schroeder
Server Support
Information Services Division
Flinders University
Adelaide, Australia
Ph: +61 8 8201 2689

Chris Harms wrote:
I am experiencing periodic failovers due to a floating IP address not 
passing the status check:

clurgmgrd: [9975]: <warning> Failed to ping 192.168.13.204
Jun 30 11:41:47 nodeA clurgmgrd[9975]: <notice> status on ip 
"192.168.13.204" returned 1 (generic error)

Both nodes have bonded NICs with gigabit connections to redundant 
switches, so it is unlikely they are going down, nothing in the logs 
about linux losing the links.  I parked all the cluster services - 2 
Postgres services and 1 Apache - on one node and allowed it to run 
overnight.  There would be no client activity during this time. One 
Postgres service failed two times in this manner and the other failed 
once in this manner.  The Apache service did not fail.

What can I do to resolve this or get more information out of the system 
to resolve this?

Thanks in advance,
Chris

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster