Hello All again, In continuation of my previous e-mail following
is the point were I located the problem. On both nodes I have the default RHEL5.1
/etc/init.d/clvmd script. Node tweety-1, after starting cman, and
after starting rgmanager it succeeds to start the services: Mar 24 16:46:57 tweety1 clurgmgrd: [10760]: <err>
script:CLVMD: stop of /etc/init.d/clvmd failed (returned 143) Mar 24 16:46:57 tweety1 clurgmgrd[10760]: <notice>
stop on script "CLVMD" returned 1 (generic error) Mar 24 16:46:57 tweety1 clurgmgrd[10760]: <info>
Services Initialized Mar 24 16:46:57 tweety1 clurgmgrd[10760]: <info> State
change: Local UP Mar 24 16:47:02 tweety1 clurgmgrd[10760]: <notice> Starting
stopped service service:GFS2-t1 Mar 24 16:47:02 tweety1 clurgmgrd[10760]: <notice>
Starting stopped service service:BOINC-t1 ………………….. So what I did with tweety-2 (the one that
hangs) was to remove from auto start both cman and rgmanager and after complete
boot up of tweety-2, manually start the services (“service start cman”
and “service start rgmanager”). This helped me to compare against
tweety-1 loggs on the same scripts / services. So I found out that for some reason I do
not understand, on tweety-2, cman starts correctly but rgmanager hangs FOREVER at
the point were tweety-1 moves on: Mar 24 20:02:16 localhost clurgmgrd[5917]: <info> I am
node #2 Mar 24 20:02:16 localhost clurgmgrd[5917]: <notice>
Resource Group Manager Starting Mar 24 20:02:16 localhost clurgmgrd[5917]: <info>
Loading Service Data Mar 24 20:02:17 localhost clurgmgrd[5917]: <info>
Initializing Services Mar 24 20:02:17 localhost clurgmgrd: [5917]: <err>
script:CLVMD: stop of /etc/init.d/clvmd failed (returned 143) Mar 24 20:02:17 localhost clurgmgrd[5917]: <notice>
stop on script "CLVMD" returned 1 (generic error) Also reboot does not work on tweety-2 since
at the point were rgmanager should shut down, again it hangs FOREVER: Mar 24 20:06:06 localhost rgmanager: [8219]: <notice>
Shutting down Cluster Service Manager... This is the last entry on the
/var/log/messages after “reboot”. Only poweroff works. Any ideas someone??? Thank you all for your time, Theophanis Kontogiannis From:
linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Theophanis Kontogiannis Hello All again, I have a two nodes cluster with the following config <?xml version="1.0"?> <cluster alias="tweety"
config_version="132" name="tweety">
<fence_daemon clean_start="0" post_fail_delay="1" post_join_delay="3"/>
<clusternodes>
<clusternode name="tweety-1" nodeid="1"
votes="1">
<fence>
<method name="1">
<device name="human-fence" nodename="tweety-1"/>
</method>
</fence>
</clusternode>
<clusternode name="tweety-2" nodeid="2"
votes="1">
<fence>
<method name="1">
<device name="human-fence" nodename="tweety-2"/>
</method>
</fence>
</clusternode>
</clusternodes> <cman
expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_manual"
name="human-fence"/>
</fencedevices> <rm
log_level="7">
<failoverdomains>
<failoverdomain name="tweety1" ordered="0"
restricted="1">
<failoverdomainnode name="tweety-1" priority="1"/>
</failoverdomain>
<failoverdomain name="tweety2" ordered="0"
restricted="1">
<failoverdomainnode name="tweety-2" priority="1"/>
</failoverdomain>
<failoverdomain name="tweety-cluster" ordered="1"
restricted="1">
<failoverdomainnode name="tweety-2" priority="1"/>
<failoverdomainnode name="tweety-1" priority="1"/>
</failoverdomain>
<failoverdomain name="tweety-1-2" ordered="1"
restricted="1">
<failoverdomainnode name="tweety-1" priority="1"/>
<failoverdomainnode name="tweety-2" priority="2"/>
</failoverdomain>
<failoverdomain name="tweety-2-1" ordered="1"
restricted="1">
<failoverdomainnode name="tweety-1" priority="2"/>
<failoverdomainnode name="tweety-2" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<script file="/etc/init.d/clvmd" name="clvmd"/>
<script file="/etc/init.d/gfs2" name="GFS2"/>
<script file="/etc/init.d/boinc" name="BOINC"/>
<script file="/etc/init.d/gfs2-check"
name="GFS2-Control"/>
</resources>
<service autostart="1" domain="tweety1"
name="LV-tweety1">
<script ref="clvmd">
<script ref="GFS2"/>
</script>
</service>
<service autostart="1" domain="tweety2"
name="LV-tweety2">
<script ref="clvmd">
<script ref="GFS2"/>
</script>
</service>
<service autostart="1" domain="tweety1"
name="BOINC-t1">
<script ref="BOINC"/>
</service>
<service autostart="1" domain="tweety2"
exclusive="0" name="BOINC-t2"
recovery="restart">
<script ref="BOINC"/>
</service>
</rm> </cluster> Tweety-1 boots up smoothly and brings up all the
services Tweety-2 boots up smoothly and brings up no services
unless I manually do “service clvmd start”
and “service gfs2 start” The log on tweety-2 is: Mar 24 04:30:18 localhost openais[2681]: [SERV ]
Initialising service handler 'openais distributed locking service B.01.01' Mar 24 04:30:18 localhost openais[2681]: [SERV ]
Initialising service handler 'openais message service B.01.01' Mar 24 04:30:18 localhost openais[2681]: [SERV ]
Initialising service handler 'openais configuration service' Mar 24 04:30:18 localhost ccsd[2672]: Cluster is not
quorate. Refusing connection. Mar 24 04:30:18 localhost openais[2681]: [SERV ]
Initialising service handler 'openais cluster closed process group service
v1.01' Mar 24 04:30:18 localhost ccsd[2672]: Error while
processing connect: Connection refused Mar 24 04:30:18 localhost openais[2681]: [SERV ]
Initialising service handler 'openais CMAN membership service 2.01' Mar 24 04:30:18 localhost openais[2681]: [CMAN ] CMAN
2.0.73 (built Nov 29 2007 18:40:32) started Mar 24 04:30:18 localhost openais[2681]: [SYNC ] Not
using a virtual synchrony filter. Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
Creating commit token because I am the rep. Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
Saving state aru 0 high seq received 0 Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
Storing new sequence id for ring 41c Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
entering COMMIT state. Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
entering RECOVERY state. Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
position [0] member 10.254.254.254: Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
previous ring seq 1048 rep 10.254.254.254 Mar 24 04:30:18 localhost openais[2681]: [TOTEM] aru
0 high delivered 0 received flag 1 Mar 24 04:30:18 localhost openais[2681]: [TOTEM] Did
not need to originate any messages in recovery. Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
Sending initial ORF token Mar 24 04:30:18 localhost openais[2681]: [CLM ]
CLM CONFIGURATION CHANGE Mar 24 04:30:18 localhost openais[2681]: [CLM ]
New Configuration: Mar 24 04:30:18 localhost openais[2681]: [CLM ]
Members Left: Mar 24 04:30:18 localhost openais[2681]: [CLM ]
Members Joined: Mar 24 04:30:18 localhost openais[2681]: [CLM ]
CLM CONFIGURATION CHANGE Mar 24 04:30:18 localhost openais[2681]: [CLM ]
New Configuration: Mar 24 04:30:18 localhost
openais[2681]: [CLM ] r(0)
ip(10.254.254.254) Mar 24 04:30:18 localhost openais[2681]: [CLM ]
Members Left: Mar 24 04:30:18 localhost openais[2681]: [CLM ]
Members Joined: Mar 24 04:30:18 localhost
openais[2681]: [CLM ] r(0)
ip(10.254.254.254) Mar 24 04:30:18 localhost openais[2681]: [SYNC ] This
node is within the primary component and will provide service. Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
entering OPERATIONAL state. Mar 24 04:30:18 localhost openais[2681]: [CMAN ]
quorum regained, resuming activity Mar 24 04:30:18 localhost openais[2681]: [CLM ]
got nodejoin message 10.254.254.254 Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
entering GATHER state from 11. Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
Saving state aru 9 high seq received 9 Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
Storing new sequence id for ring 420 Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
entering COMMIT state. Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
entering RECOVERY state. Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
position [0] member 10.254.254.253: Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
previous ring seq 1052 rep 10.254.254.253 Mar 24 04:30:18 localhost openais[2681]: [TOTEM] aru
c high delivered c received flag 1 Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
position [1] member 10.254.254.254: Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
previous ring seq 1052 rep 10.254.254.254 Mar 24 04:30:18 localhost openais[2681]: [TOTEM] aru
9 high delivered 9 received flag 1 Mar 24 04:30:18 localhost openais[2681]: [TOTEM] Did
not need to originate any messages in recovery. Mar 24 04:30:18 localhost openais[2681]: [CLM ]
CLM CONFIGURATION CHANGE Mar 24 04:30:18 localhost openais[2681]: [CLM ]
New Configuration: Mar 24 04:30:18 localhost
openais[2681]: [CLM ] r(0)
ip(10.254.254.254) Mar 24 04:30:18 localhost openais[2681]: [CLM ]
Members Left: Mar 24 04:30:18 localhost openais[2681]: [CLM ]
Members Joined: Mar 24 04:30:18 localhost openais[2681]: [CLM ]
CLM CONFIGURATION CHANGE Mar 24 04:30:18 localhost openais[2681]: [CLM ]
New Configuration: Mar 24 04:30:18 localhost
openais[2681]: [CLM ] r(0)
ip(10.254.254.253) Mar 24 04:30:18 localhost
openais[2681]: [CLM ] r(0)
ip(10.254.254.254) Mar 24 04:30:18 localhost openais[2681]: [CLM ]
Members Left: Mar 24 04:30:18 localhost openais[2681]: [CLM ]
Members Joined: Mar 24 04:30:18 localhost
openais[2681]: [CLM ] r(0)
ip(10.254.254.253) Mar 24 04:30:18 localhost openais[2681]: [SYNC ] This
node is within the primary component and will provide service. Mar 24 04:30:18 localhost openais[2681]: [TOTEM]
entering OPERATIONAL state. Mar 24 04:30:18 localhost openais[2681]: [ Mar 24 04:30:18 localhost openais[2681]: [MAIN ]
Invalid packet data Mar 24 04:30:18 localhost openais[2681]: [CLM ]
got nodejoin message 10.254.254.253 Mar 24 04:30:18 localhost openais[2681]: [CLM ]
got nodejoin message 10.254.254.254 Mar 24 04:30:18 localhost openais[2681]: [CPG ]
got joinlist message from node 2 Mar 24 04:30:18 localhost openais[2681]: [CPG ]
got joinlist message from node 1 Mar 24 04:30:18 localhost ccsd[2672]: Initial
status:: Quorate Mar 24 04:30:44 localhost modclusterd: startup
succeeded Mar 24 04:30:45 localhost kernel: dlm: Using TCP for
communications Mar 24 04:30:45 localhost kernel: dlm: connecting to
1 Mar 24 04:30:45 localhost kernel: dlm: got connection
from 1 Mar 24 04:30:46 localhost clurgmgrd[3200]:
<notice> Resource Group Manager Starting Mar 24 04:30:46 localhost clurgmgrd[3200]:
<info> Loading Service Data Mar 24 04:30:55 localhost clurgmgrd[3200]:
<info> Initializing Services Mar 24 04:30:58 localhost clurgmgrd: [3200]:
<err> script:clvmd: stop of /etc/init.d/clvmd failed (returned 143) Mar 24 04:30:58 localhost clurgmgrd[3200]:
<notice> stop on script "clvmd" returned 1 (generic error) AND THAT’s IT ALL. However on tweety-1 the log goes further then were
tweety-2 stops: Mar 24 04:23:39 tweety1 clurgmgrd[3379]: <info>
Services Initialized Mar 24 04:23:39 tweety1 clurgmgrd[3379]: <info>
State change: Local UP Mar 24 04:23:45 tweety1 clurgmgrd[3379]:
<notice> Starting stopped service service:LV-tweety1 Mar 24 04:23:45 tweety1 clurgmgrd[3379]:
<notice> Starting stopped service service:BOINC-t1 Mar 24 04:23:45 tweety1 clurgmgrd: [3379]:
<err> script:BOINC: start of /etc/init.d/boinc failed (returned 1) Mar 24 04:23:45 tweety1 clurgmgrd[3379]:
<notice> start on script "BOINC" returned 1 (generic error) Mar 24 04:23:45 tweety1 clurgmgrd[3379]:
<warning> #68: Failed to start service:BOINC-t1; return value: 1 Mar 24 04:23:45 tweety1 clurgmgrd[3379]:
<notice> Stopping service service:BOINC-t1 Mar 24 04:23:45 tweety1 clurgmgrd[3379]:
<notice> Service service:BOINC-t1 is recovering Mar 24 04:23:45 tweety1 clurgmgrd[3379]:
<warning> #71: Relocating failed service service:BOINC-t1 Mar 24 04:23:45 tweety1 clurgmgrd[3379]:
<notice> Stopping service service:BOINC-t1 Mar 24 04:23:46 tweety1 clurgmgrd[3379]:
<notice> Service service:BOINC-t1 is stopped Mar 24 04:23:46 tweety1 clvmd: Cluster LVM daemon
started - connected to CMAN Mar 24 04:23:48 tweety1 kernel: GFS2: fsid=: Trying
to join cluster "lock_dlm", "tweety:gfs0" Mar 24 04:23:48 tweety1 kernel: GFS2:
fsid=tweety:gfs0.0: Joined cluster. Now mounting FS... Mar 24 04:23:49 tweety1 clurgmgrd[3379]:
<notice> Service service:LV-tweety1 started Mar 24 04:24:42 tweety1 kernel: dlm: closing
connection to node 2 Mar 24 04:25:21 tweety1 kernel: dlm: closing
connection to node 2 Mar 24 04:27:32 tweety1 kernel: dlm: closing
connection to node 2 Can someone give food for thoughts as to
what the problem might be? Do I need to provide more information? Thank you all for your time Theophanis Kontogiannis |
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster