Re: Suggestion for backbone network maintenance

Gianluca Cecchi <gianluca.cecchi@xxxxxxxxx> · Thu, 8 Oct 2009 15:04:20 +0200

On Wed, Oct 7, 2009 at 5:03 PM, Gianluca Cecchi <gianluca.cecchi@xxxxxxxxx> wrote:

Hello,
cluster rh el 5.3 with 2 nodes and a quorum disk with heuristics. The nodes are in different sites.
At this moment inside cluster.conf I have this:

        <quorumd device="/dev/mapper/mpath6" interval="5" label="oraquorum" log_facility="local4" log_level="7" tko="16" votes="1">

                <heuristic interval="2" program="ping -c1 -w1 10.4.5.250" score="1" tko="20"/>
        </quorumd>

[snip]

It seems it doesn't work as I expected....
You have to manually restart qdisk daemon to have it catch the changes.
I would expect cluster manager to communicate with it when you do a ccs_tool update....

qdiskd seems not to have a sort of reload function... (based on init script options at least)
And also, in my situation, it is better to have both the nodes up'n'running.
In fact when you restart qdiskd it actually takes about 2 minutes and 10 seconds to re-register and count as one vote out of three.

Some seconds before of this, I get the emergency message where I lost quorum and my services (FS and IP) are suddenly stopped and then restarted when quorum regained.....

So the successfull steps are, at least in my case:

node 1 and 2 both up and running cluster services
node1
- change to cluster.conf incrementing version number and putting tko=1500

- ccs_tool update /etc/cluster/cluster.conf
- cman_tool version -r <new_version>   (is this still necessary?????)
- service qdiskd restart; sleep 2; service qdiskd start
(sometimes due to a bug in qdiskd it doesn't suddenly start, even if you do stop/start; so that for safe I have to put a new start command just after the first attempt...

more precisely: bug https://bugzilla.redhat.com/show_bug.cgi?id=485199
I'm in cman 2.0.98-1.el5_3.1 to simulate my prod cluster and this bug seems to be first fixed in rh el 5.4 with cman-2.0.115-1.el5, then superseded a few day after by important fix 2.0.115-1.el5_4.2)

Anyway, after about 2 minutes and 10 seconds the qdiskd finishes its initialization and synchronises with the other one...

Now I can go to node2 and run on it
- service qdiskd restart; sleep 2; service qdiskd start

This way both the nodes are aligned with qdiskd changes.

In my case then I can shutdown node2 and go through waiting network people tell me that maintenance is finished, to re-apply initial configuration....

Comments are again welcome obviously ;-)

Gianluca

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster