Re: share experience migrating cluster suite from centos 5.3 to centos 5.4

Gianluca Cecchi <gianluca.cecchi@xxxxxxxxx> · Thu, 5 Nov 2009 11:32:22 +0100

On Thu, Nov 5, 2009 at 10:38 AM, Gianluca Cecchi <gianluca.cecchi@xxxxxxxxx> wrote:

[snip]
two other things:
1) I see these messages about quorum inside the first node, that didn't came during the previous days in 5.3 env

Nov  5 08:00:14 mork clurgmgrd: [2692]: <notice> Getting status 
Nov  5 08:27:08 mork qdiskd[2206]: <warning> qdiskd: read (system call) has hung for 40 seconds 
Nov  5 08:27:08 mork qdiskd[2206]: <warning> In 40 more seconds, we will be evicted 

Nov  5 09:00:15 mork clurgmgrd: [2692]: <notice> Getting status 
Nov  5 09:00:15 mork clurgmgrd: [2692]: <notice> Getting status 
Nov  5 09:48:23 mork qdiskd[2206]: <warning> qdiskd: read (system call) has hung for 40 seconds 

Nov  5 09:48:23 mork qdiskd[2206]: <warning> In 40 more seconds, we will be evicted 
Nov  5 10:00:15 mork clurgmgrd: [2692]: <notice> Getting status 
Nov  5 10:00:15 mork clurgmgrd: [2692]: <notice> Getting status 

Any timings changed between releases?
My relevant lines about timings in cluster.conf were in 5.3 and remained so in 5.4:

<cluster alias="clumm" config_version="7" name="clumm">

        <totem token="162000"/>
        <cman quorum_dev_poll="80000" expected_votes="3" two_node="0"/>
        <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="20"/>

        <quorumd device="/dev/sda" interval="5" label="clummquorum" log_facility="local4" log_level="7" tko="16" votes="1">
                <heuristic interval="2" program="ping -c1 -w1 192.168.122.1" score="1" tko="3000"/>

        </quorumd>

(tko very big in heuristic because I was testing best and safer way to do on-the-fly changes to heuristic, due to network maintenance activity causing gw disappear for some time, not predictable by the net-guys...)

I don't know if this message is deriving from a problem with latencies in my virtual env or not....
On the host side I don't see any message with dmesg command or in /var/log/messages.....

2) saw that a new kernel just released...... ;-(

Hints about possible interferences with cluster infra?

Gianluca

Probably 1) is due to this bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=500450

that found its solution released in  RHSA-2009-1341 advisory
with cman-2.0.115-1.el5.x86_64.rpm.
And coming from 2.0.98 this is reasonable.
In my case tko=16 and interval=5, so that max time tolerance is about 80 seconds that is the 40+40 seconds I see inside the messages....

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster