Follow Up: Problem with 2 node cluster - node 2 not starting services

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Title: Problem with 2 node cluster - node 2 not starting services

Hello All again,

 

In continuation of my previous e-mail following is the point were I located the problem.

 

On both nodes I have the default RHEL5.1 /etc/init.d/clvmd script.

 

Node tweety-1, after starting cman, and after starting rgmanager it succeeds to start the services:

 

Mar 24 16:46:57 tweety1 clurgmgrd: [10760]: <err> script:CLVMD: stop of /etc/init.d/clvmd failed (returned 143)

Mar 24 16:46:57 tweety1 clurgmgrd[10760]: <notice> stop on script "CLVMD" returned 1 (generic error)

Mar 24 16:46:57 tweety1 clurgmgrd[10760]: <info> Services Initialized

Mar 24 16:46:57 tweety1 clurgmgrd[10760]: <info> State change: Local UP

Mar 24 16:47:02 tweety1 clurgmgrd[10760]: <notice> Starting stopped service service:GFS2-t1

Mar 24 16:47:02 tweety1 clurgmgrd[10760]: <notice> Starting stopped service service:BOINC-t1

            …………………..

 

So what I did with tweety-2 (the one that hangs) was to remove from auto start both cman and rgmanager and after complete boot up of tweety-2, manually start the services (“service start cman” and “service start rgmanager”). This helped me to compare against tweety-1 loggs on the same scripts / services.

 

So I found out that for some reason I do not understand, on tweety-2, cman starts correctly but rgmanager hangs FOREVER at the point were tweety-1 moves on:

 

Mar 24 20:02:16 localhost clurgmgrd[5917]: <info> I am node #2

Mar 24 20:02:16 localhost clurgmgrd[5917]: <notice> Resource Group Manager Starting

Mar 24 20:02:16 localhost clurgmgrd[5917]: <info> Loading Service Data

Mar 24 20:02:17 localhost clurgmgrd[5917]: <info> Initializing Services

Mar 24 20:02:17 localhost clurgmgrd: [5917]: <err> script:CLVMD: stop of /etc/init.d/clvmd failed (returned 143)

Mar 24 20:02:17 localhost clurgmgrd[5917]: <notice> stop on script "CLVMD" returned 1 (generic error)

 

Also reboot does not work on tweety-2 since at the point were rgmanager should shut down, again it hangs FOREVER:

 

Mar 24 20:06:06 localhost rgmanager: [8219]: <notice> Shutting down Cluster Service Manager...

 

This is the last entry on the /var/log/messages after “reboot”. Only poweroff works.

 

Any ideas someone???

 

Thank you all for your time,

Theophanis Kontogiannis

 

 


From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Theophanis Kontogiannis
Sent: Monday, March 24, 2008 4:43 AM
To: linux-cluster@xxxxxxxxxx
Subject: Problem with 2 node cluster - node 2 not startingservices

 

Hello All again,

I have a two nodes cluster with the following config

<?xml version="1.0"?>

<cluster alias="tweety" config_version="132" name="tweety">

        <fence_daemon clean_start="0" post_fail_delay="1" post_join_delay="3"/>

        <clusternodes>

                <clusternode name="tweety-1" nodeid="1" votes="1">

                        <fence>

                                <method name="1">

                                        <device name="human-fence" nodename="tweety-1"/>

                                </method>

                        </fence>

                </clusternode>

                <clusternode name="tweety-2" nodeid="2" votes="1">

                        <fence>

                                <method name="1">

                                        <device name="human-fence" nodename="tweety-2"/>

                                </method>

                        </fence>

                </clusternode>

        </clusternodes>

        <cman expected_votes="1" two_node="1"/>

        <fencedevices>

                <fencedevice agent="fence_manual" name="human-fence"/>

        </fencedevices>

        <rm log_level="7">

                <failoverdomains>

                        <failoverdomain name="tweety1" ordered="0" restricted="1">

                                <failoverdomainnode name="tweety-1" priority="1"/>

                        </failoverdomain>

                        <failoverdomain name="tweety2" ordered="0" restricted="1">

                                <failoverdomainnode name="tweety-2" priority="1"/>

                        </failoverdomain>

                        <failoverdomain name="tweety-cluster" ordered="1" restricted="1">

                                <failoverdomainnode name="tweety-2" priority="1"/>

                                <failoverdomainnode name="tweety-1" priority="1"/>

                        </failoverdomain>

                        <failoverdomain name="tweety-1-2" ordered="1" restricted="1">

                                <failoverdomainnode name="tweety-1" priority="1"/>

                                <failoverdomainnode name="tweety-2" priority="2"/>

                        </failoverdomain>

                        <failoverdomain name="tweety-2-1" ordered="1" restricted="1">

                                <failoverdomainnode name="tweety-1" priority="2"/>

                                <failoverdomainnode name="tweety-2" priority="1"/>

                        </failoverdomain>

                </failoverdomains>

                <resources>

                        <script file="/etc/init.d/clvmd" name="clvmd"/>

                        <script file="/etc/init.d/gfs2" name="GFS2"/>

                        <script file="/etc/init.d/boinc" name="BOINC"/>

                        <script file="/etc/init.d/gfs2-check" name="GFS2-Control"/>

                </resources>

                <service autostart="1" domain="tweety1" name="LV-tweety1">

                        <script ref="clvmd">

                                <script ref="GFS2"/>

                        </script>

                </service>

                <service autostart="1" domain="tweety2" name="LV-tweety2">

                        <script ref="clvmd">

                                <script ref="GFS2"/>

                        </script>

                </service>

                <service autostart="1" domain="tweety1" name="BOINC-t1">

                        <script ref="BOINC"/>

                </service>

                <service autostart="1" domain="tweety2" exclusive="0" name="BOINC-t2" recovery="restart">

                        <script ref="BOINC"/>

                </service>

        </rm>

</cluster>

 

Tweety-1 boots up smoothly and brings up all the services

Tweety-2 boots up smoothly and brings up no services unless I manually do “service clvmd start” and “service gfs2 start”

The log on tweety-2 is:

Mar 24 04:30:18 localhost openais[2681]: [SERV ] Initialising service handler 'openais distributed locking service B.01.01'

Mar 24 04:30:18 localhost openais[2681]: [SERV ] Initialising service handler 'openais message service B.01.01'

Mar 24 04:30:18 localhost openais[2681]: [SERV ] Initialising service handler 'openais configuration service'

Mar 24 04:30:18 localhost ccsd[2672]: Cluster is not quorate.  Refusing connection.

Mar 24 04:30:18 localhost openais[2681]: [SERV ] Initialising service handler 'openais cluster closed process group service v1.01'

Mar 24 04:30:18 localhost ccsd[2672]: Error while processing connect: Connection refused

Mar 24 04:30:18 localhost openais[2681]: [SERV ] Initialising service handler 'openais CMAN membership service 2.01'

Mar 24 04:30:18 localhost openais[2681]: [CMAN ] CMAN 2.0.73 (built Nov 29 2007 18:40:32) started

Mar 24 04:30:18 localhost openais[2681]: [SYNC ] Not using a virtual synchrony filter.

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] Creating commit token because I am the rep.

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] Saving state aru 0 high seq received 0

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] Storing new sequence id for ring 41c

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] entering COMMIT state.

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] entering RECOVERY state.

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] position [0] member 10.254.254.254:

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] previous ring seq 1048 rep 10.254.254.254

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] aru 0 high delivered 0 received flag 1

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] Did not need to originate any messages in recovery.

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] Sending initial ORF token

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] CLM CONFIGURATION CHANGE

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] New Configuration:

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] Members Left:

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] Members Joined:

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] CLM CONFIGURATION CHANGE

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] New Configuration:

Mar 24 04:30:18 localhost openais[2681]: [CLM  ]        r(0) ip(10.254.254.254)

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] Members Left:

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] Members Joined:

Mar 24 04:30:18 localhost openais[2681]: [CLM  ]        r(0) ip(10.254.254.254)

Mar 24 04:30:18 localhost openais[2681]: [SYNC ] This node is within the primary component and will provide service.

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] entering OPERATIONAL state.

Mar 24 04:30:18 localhost openais[2681]: [CMAN ] quorum regained, resuming activity

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] got nodejoin message 10.254.254.254

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] entering GATHER state from 11.

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] Saving state aru 9 high seq received 9

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] Storing new sequence id for ring 420

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] entering COMMIT state.

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] entering RECOVERY state.

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] position [0] member 10.254.254.253:

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] previous ring seq 1052 rep 10.254.254.253

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] aru c high delivered c received flag 1

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] position [1] member 10.254.254.254:

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] previous ring seq 1052 rep 10.254.254.254

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] aru 9 high delivered 9 received flag 1

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] Did not need to originate any messages in recovery.

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] CLM CONFIGURATION CHANGE

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] New Configuration:

Mar 24 04:30:18 localhost openais[2681]: [CLM  ]        r(0) ip(10.254.254.254)

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] Members Left:

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] Members Joined:

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] CLM CONFIGURATION CHANGE

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] New Configuration:

Mar 24 04:30:18 localhost openais[2681]: [CLM  ]        r(0) ip(10.254.254.253)

Mar 24 04:30:18 localhost openais[2681]: [CLM  ]        r(0) ip(10.254.254.254)

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] Members Left:

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] Members Joined:

Mar 24 04:30:18 localhost openais[2681]: [CLM  ]        r(0) ip(10.254.254.253)

Mar 24 04:30:18 localhost openais[2681]: [SYNC ] This node is within the primary component and will provide service.

Mar 24 04:30:18 localhost openais[2681]: [TOTEM] entering OPERATIONAL state.

Mar 24 04:30:18 localhost openais[2681]: [MAIN ] Received message has invalid digest... ignoring.

Mar 24 04:30:18 localhost openais[2681]: [MAIN ] Invalid packet data

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] got nodejoin message 10.254.254.253

Mar 24 04:30:18 localhost openais[2681]: [CLM  ] got nodejoin message 10.254.254.254

Mar 24 04:30:18 localhost openais[2681]: [CPG  ] got joinlist message from node 2

Mar 24 04:30:18 localhost openais[2681]: [CPG  ] got joinlist message from node 1

Mar 24 04:30:18 localhost ccsd[2672]: Initial status:: Quorate

Mar 24 04:30:44 localhost modclusterd: startup succeeded

Mar 24 04:30:45 localhost kernel: dlm: Using TCP for communications

Mar 24 04:30:45 localhost kernel: dlm: connecting to 1

Mar 24 04:30:45 localhost kernel: dlm: got connection from 1

Mar 24 04:30:46 localhost clurgmgrd[3200]: <notice> Resource Group Manager Starting

Mar 24 04:30:46 localhost clurgmgrd[3200]: <info> Loading Service Data

Mar 24 04:30:55 localhost clurgmgrd[3200]: <info> Initializing Services

Mar 24 04:30:58 localhost clurgmgrd: [3200]: <err> script:clvmd: stop of /etc/init.d/clvmd failed (returned 143)

Mar 24 04:30:58 localhost clurgmgrd[3200]: <notice> stop on script "clvmd" returned 1 (generic error)

AND THAT’s IT ALL.

However on tweety-1 the log goes further then were tweety-2 stops:

Mar 24 04:23:39 tweety1 clurgmgrd[3379]: <info> Services Initialized

Mar 24 04:23:39 tweety1 clurgmgrd[3379]: <info> State change: Local UP

Mar 24 04:23:45 tweety1 clurgmgrd[3379]: <notice> Starting stopped service service:LV-tweety1

Mar 24 04:23:45 tweety1 clurgmgrd[3379]: <notice> Starting stopped service service:BOINC-t1

Mar 24 04:23:45 tweety1 clurgmgrd: [3379]: <err> script:BOINC: start of /etc/init.d/boinc failed (returned 1)

Mar 24 04:23:45 tweety1 clurgmgrd[3379]: <notice> start on script "BOINC" returned 1 (generic error)

Mar 24 04:23:45 tweety1 clurgmgrd[3379]: <warning> #68: Failed to start service:BOINC-t1; return value: 1

Mar 24 04:23:45 tweety1 clurgmgrd[3379]: <notice> Stopping service service:BOINC-t1

Mar 24 04:23:45 tweety1 clurgmgrd[3379]: <notice> Service service:BOINC-t1 is recovering

Mar 24 04:23:45 tweety1 clurgmgrd[3379]: <warning> #71: Relocating failed service service:BOINC-t1

Mar 24 04:23:45 tweety1 clurgmgrd[3379]: <notice> Stopping service service:BOINC-t1

Mar 24 04:23:46 tweety1 clurgmgrd[3379]: <notice> Service service:BOINC-t1 is stopped

Mar 24 04:23:46 tweety1 clvmd: Cluster LVM daemon started - connected to CMAN

Mar 24 04:23:48 tweety1 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "tweety:gfs0"

Mar 24 04:23:48 tweety1 kernel: GFS2: fsid=tweety:gfs0.0: Joined cluster. Now mounting FS...

Mar 24 04:23:49 tweety1 clurgmgrd[3379]: <notice> Service service:LV-tweety1 started

Mar 24 04:24:42 tweety1 kernel: dlm: closing connection to node 2

Mar 24 04:25:21 tweety1 kernel: dlm: closing connection to node 2

Mar 24 04:27:32 tweety1 kernel: dlm: closing connection to node 2

Can someone give food for thoughts as to what the problem might be? Do I need to provide more information?

Thank you all for your time

Theophanis Kontogiannis

 

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux