Re: share experience migrating cluster suite from centos 5.3 to centos 5.4

Gianluca Cecchi <gianluca.cecchi@xxxxxxxxx> · Thu, 5 Nov 2009 10:38:34 +0100

On Wed, Nov 4, 2009 at 12:57 PM, Gianluca Cecchi <gianluca.cecchi@xxxxxxxxx> wrote:

On Wed, 4 Nov 2009 15:33:19 +1000 Peter Tiggerdine wrote:
> 7. Your going to need to copy this over manually otherwise it 
> will fail, I've fallen victim of this before. All cluster nodes need to start on 

> the current revision of the file before you update it. I think this is a chicken 
> and egg problem.

In the past I already encountered this situation. And in all cases, the starting node detects its version as not up2date and gets its new config from the other node.

My scenario was: 
node 1 and node 2 up
node 2 shutdown
change node1 config (I mean here in term of services, probably not valid if inserting a qdiskd section when not available before, or possibly in other cases)

power on node2
node 2 gets the new config and apply it (based on availability and correctness of definitions)

So I don't think this is correct.....
Any one commenting on this?
Do you have the messages of the errors when you get this problem?

On Wed, 4 Nov 2009 12:30:57 +0100 Jakov Sosic wrote:
> Well I usually do rolling updates, (i relocate the services to other
> nodes, and update one node, then restart it and see if it joins
> cluster).

OK. In fact I'm now working on a test cluster, just to get the correct workflow.
But you are saying you did this also for 5.3 -> 5.4, while I experienced the oom problem that David documented too, with the entry in bugzilla......

So you joined a just updated 5.4 node to its previous cluster (composed by all 5.3 nodes) and you didn't get any problem at all?

Gianluca

OK. All went well in my virtual environment.
More, in step 7, I created a new ip service and updated my config into the first updated node, enabling it while the second node, still in 5.3, was down:

This below the diff with the pre-5.4 

< <cluster alias="clumm" config_version="7" name="clumm">
---
> <cluster alias="clumm" config_version="5" name="clumm">

38,41d37
<             <failoverdomain name="MM3" restricted="1" ordered="1" nofailback="1">
<                 <failoverdomainnode name="node1" priority="2"/>

<                 <failoverdomainnode name="node2" priority="1"/>
<             </failoverdomain>
46d41
<             <ip address="192.168.122.113" monitor_link="0"/>

62,64d56
<         <service domain="MM3" autostart="1" name="MM3SRV">
<             <ip ref="192.168.122.113"/>
<         </service>

When the second node in step 11) joins the cluster, it indeed gets the updated config and all goes well.

I also successfully relocated this new server from former node to the other one.
No oom with this approach as written by David.
thanks

two other things:
1) I see these messages about quorum inside the first node, that didn't came during the previous days in 5.3 env

Nov  5 08:00:14 mork clurgmgrd: [2692]: <notice> Getting status 
Nov  5 08:27:08 mork qdiskd[2206]: <warning> qdiskd: read (system call) has hung for 40 seconds 
Nov  5 08:27:08 mork qdiskd[2206]: <warning> In 40 more seconds, we will be evicted 

Nov  5 09:00:15 mork clurgmgrd: [2692]: <notice> Getting status 
Nov  5 09:00:15 mork clurgmgrd: [2692]: <notice> Getting status 
Nov  5 09:48:23 mork qdiskd[2206]: <warning> qdiskd: read (system call) has hung for 40 seconds 

Nov  5 09:48:23 mork qdiskd[2206]: <warning> In 40 more seconds, we will be evicted 
Nov  5 10:00:15 mork clurgmgrd: [2692]: <notice> Getting status 
Nov  5 10:00:15 mork clurgmgrd: [2692]: <notice> Getting status 

Any timings changed between releases?
My relevant lines about timings in cluster.conf were in 5.3 and remained so in 5.4:

<cluster alias="clumm" config_version="7" name="clumm">

        <totem token="162000"/>
        <cman quorum_dev_poll="80000" expected_votes="3" two_node="0"/>
        <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="20"/>

        <quorumd device="/dev/sda" interval="5" label="clummquorum" log_facility="local4" log_level="7" tko="16" votes="1">
                <heuristic interval="2" program="ping -c1 -w1 192.168.122.1" score="1" tko="3000"/>

        </quorumd>

(tko very big in heuristic because I was testing best and safer way to do on-the-fly changes to heuristic, due to network maintenance activity causing gw disappear for some time, not predictable by the net-guys...)

I don't know if this message is deriving from a problem with latencies in my virtual env or not....
On the host side I don't see any message with dmesg command or in /var/log/messages.....

2) saw that a new kernel just released...... ;-(

Hints about possible interferences with cluster infra?

Gianluca

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster