Severe problems with 64-bit RHCS on RHEL5.1

<Harri.Paivaniemi@xxxxxxxxxxxxxxx> · Thu, 17 Apr 2008 08:40:24 +0300

Hi all,

Short introduction: My name is Harry, I'm working in Helsinki, Finland and have used RHCS from the beginning, we have currently 7 clusters mainly running MySQL/Oracle databases.

I tought I have some kind of knowledge about this clustering software and everything seemed to be ok until version 5. I don't have problems or severe bugging in any of RH4- clusters.

But....

Tryed to move --> 5.1 with 64-bit HP Blades. Cluster just won't work or it works but I don't have any kind of trust to it anymore. I have made about 20 different scenarios and there is totally too much problems, couple of those will prevent me to use this anymore. I have created 3 tickets to RH support and it seems to me that they don't know that little what I know. I have had to tell them 2 times to read the f...g manual, because they have spoken directly agains qdisk man-page. They just don't know how it should work... hard to believe but tru.

First, I asked how to change cman deadnode_timeout in 5, because /proc doesn't anymore have it and that parameter didn't work on my tests. Support said "you can't tune the timeout at all". I asked, how can I  use qdisk if man page says cman's timeout must be > than qdisk eviction timeout.... and told them to read the man-page... finally I found myself the correct parameter "totem token"

Second time, they said in my 2-node cluster I made a mistake when I gave 1 vote for the quorum disk... but man-page again tell's to do that and of course it is correct in 2-node cluster....

So, this is my sad history with ver 5. Do you use 64-bit ver 5 and what's your feeling?

My problems this time are:

1. 2-node cluster. Can't start only one node to get cluster services up - it hangs in fencing and waits until I start te second node and immediately after that, when both nodes are starting cman, the cluster comes up. So if I have lost one node, I can't get the cluster up, if I have to restart for seome reason the working node. It should work like before (both nodes are down, I start one, it fences another and comes up). Now it just waits... log says:

ccsd[25272]: Error while processing connect: Connection refused

This is so common error message, that it just tell's nothing to me....

2. qdisk doesn't work. 2- node cluster. Start it (both nodes at the same time) to get it up. Works ok, qdisk works, heuristic works. Everything works. If I stop cluster daemons on one node, that node can't join to cluster anymore without a complete reboot. It joins, another node says ok, the node itself says ok, quorum is registred and heuristic is up, but the node's quorum-disk stays offline and another node says this node is offline. If I reboot this machine, it joins to cluster ok.

3. Funny thing: heuristic ping didn't work at all in the beginning and support gave me a "ping-script" which make it to work... so this describes quite well how experimental this cluster is nowadays...

I have to tell you it is a FACT that basics are ok: fencing works ok in a normal situation, I don't have typos, configs are in sync,  everything is ok, but these problems still exists.

I have 2 times sent sosreports etc. so RH support. They hava spent 3 weeks and still can't say whats wrong...

Just if somebody has something in mind to help...

Thanks,

-hjp

<<winmail.dat>>
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster