On Wed, Nov 4, 2009 at 12:57 PM, Gianluca Cecchi <gianluca.cecchi@xxxxxxxxx> wrote:
On Wed, 4 Nov 2009 15:33:19 +1000 Peter Tiggerdine wrote:OK. In fact I'm now working on a test cluster, just to get the correct workflow.
> 7. Your going to need to copy this over manually otherwise it
> will fail, I've fallen victim of this before. All cluster nodes need to start on
> the current revision of the file before you update it. I think this is a chicken
> and egg problem.
In the past I already encountered this situation. And in all cases, the starting node detects its version as not up2date and gets its new config from the other node.
My scenario was:
node 1 and node 2 up
node 2 shutdown
change node1 config (I mean here in term of services, probably not valid if inserting a qdiskd section when not available before, or possibly in other cases)
power on node2
node 2 gets the new config and apply it (based on availability and correctness of definitions)
So I don't think this is correct.....
Any one commenting on this?
Do you have the messages of the errors when you get this problem?
On Wed, 4 Nov 2009 12:30:57 +0100 Jakov Sosic wrote:
> Well I usually do rolling updates, (i relocate the services to other
> nodes, and update one node, then restart it and see if it joins
> cluster).
But you are saying you did this also for 5.3 -> 5.4, while I experienced the oom problem that David documented too, with the entry in bugzilla......
So you joined a just updated 5.4 node to its previous cluster (composed by all 5.3 nodes) and you didn't get any problem at all?
Gianluca
OK. All went well in my virtual environment.
More, in step 7, I created a new ip service and updated my config into the first updated node, enabling it while the second node, still in 5.3, was down:
This below the diff with the pre-5.4
< <cluster alias="clumm" config_version="7" name="clumm">
---
> <cluster alias="clumm" config_version="5" name="clumm">
38,41d37
< <failoverdomain name="MM3" restricted="1" ordered="1" nofailback="1">
< <failoverdomainnode name="node1" priority="2"/>
< <failoverdomainnode name="node2" priority="1"/>
< </failoverdomain>
46d41
< <ip address="192.168.122.113" monitor_link="0"/>
62,64d56
< <service domain="MM3" autostart="1" name="MM3SRV">
< <ip ref="192.168.122.113"/>
< </service>
When the second node in step 11) joins the cluster, it indeed gets the updated config and all goes well.
I also successfully relocated this new server from former node to the other one.
No oom with this approach as written by David.
thanks
two other things:
1) I see these messages about quorum inside the first node, that didn't came during the previous days in 5.3 env
Nov 5 08:00:14 mork clurgmgrd: [2692]: <notice> Getting status
Nov 5 08:27:08 mork qdiskd[2206]: <warning> qdiskd: read (system call) has hung for 40 seconds
Nov 5 08:27:08 mork qdiskd[2206]: <warning> In 40 more seconds, we will be evicted
Nov 5 09:00:15 mork clurgmgrd: [2692]: <notice> Getting status
Nov 5 09:00:15 mork clurgmgrd: [2692]: <notice> Getting status
Nov 5 09:48:23 mork qdiskd[2206]: <warning> qdiskd: read (system call) has hung for 40 seconds
Nov 5 09:48:23 mork qdiskd[2206]: <warning> In 40 more seconds, we will be evicted
Nov 5 10:00:15 mork clurgmgrd: [2692]: <notice> Getting status
Nov 5 10:00:15 mork clurgmgrd: [2692]: <notice> Getting status
Any timings changed between releases?
My relevant lines about timings in cluster.conf were in 5.3 and remained so in 5.4:
<cluster alias="clumm" config_version="7" name="clumm">
<totem token="162000"/>
<cman quorum_dev_poll="80000" expected_votes="3" two_node="0"/>
<fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="20"/>
<quorumd device="/dev/sda" interval="5" label="clummquorum" log_facility="local4" log_level="7" tko="16" votes="1">
<heuristic interval="2" program="ping -c1 -w1 192.168.122.1" score="1" tko="3000"/>
</quorumd>
(tko very big in heuristic because I was testing best and safer way to do on-the-fly changes to heuristic, due to network maintenance activity causing gw disappear for some time, not predictable by the net-guys...)
I don't know if this message is deriving from a problem with latencies in my virtual env or not....
On the host side I don't see any message with dmesg command or in /var/log/messages.....
2) saw that a new kernel just released...... ;-(
Hints about possible interferences with cluster infra?
Gianluca
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster