what happens during orderly shutdown ?

"Martin Waite" <Martin.Waite@xxxxxxxxxxxx> · Mon, 16 Aug 2010 17:53:40 +0100

Hi,

We
have hit a snag during testing of a 3-node cluster, where only nodes 1 and 2
are allowed to run the service.

If
we shutdown the node running the service, the service first starts on the
surviving node, but is then stopped.  This is not an activity we would
normally expect to do, but some of the environment monitoring in our servers
can trigger this situation - temperature getting too high etc.

What
I think happens is the following:

1.   
host
1: shutdown calls on rgmanager to stop:  this takes several seconds,
stopping the service.

2.   
host
2: rgmanager starts to relocate the service to the surviving node

3.   
host
1: shutdown calls on cman to stop

4.   
host
2: while relocating the service, cman enters gather state because host 1 has
dropped out.  The service is still starting and triggers "Failed
changing service status" in log below.

5.   
host
1: halts

6.   
host
2: the service completes start up.  The cluster reconfiguration completes. 

7.   
host
2: the service is stopped.

Is
this a bug in the cluster system, or just an unfortunate sequence of events
that the cluster cannot cope with, i.e. node failure while a service is relocating
?

regards,

Martin

Here
is the log from node 2:

Aug 16 16:58:40 gmhcluster2 clurgmgrd[4132]: <notice>
Member 1 shutting down

Aug 16 16:58:40 gmhcluster2 openais[4084]: [TOTEM]
Retransmit List: 86

Aug 16 16:58:46 gmhcluster2 clurgmgrd[4132]: <notice>
Starting stopped service service:MASTERVIP

Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] The token
was lost in the OPERATIONAL state.

Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] Receive
multicast socket recv buffer size (320000 bytes).

Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] Transmit
multicast socket send buffer size (262142 bytes).

Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] entering
GATHER state from 2.

Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <err>
#47: Failed changing service status

Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <warning>
#71: Relocating failed service service:MASTERVIP

Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <warning>
#70: Failed to relocate service:MASTERVIP; restarting locally

Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <notice>
Starting stopped service service:MASTERVIP

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering
GATHER state from 0.

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Creating
commit token because I am the rep.

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Saving
state aru 90 high seq received 90

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Storing
new sequence id for ring 1e4

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering
COMMIT state.

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering
RECOVERY state.

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] position
[0] member 172.30.31.4:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] previous
ring seq 480 rep 172.30.31.3

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] aru 90
high delivered 90 received flag 1

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] position
[1] member 172.30.31.5:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] previous
ring seq 480 rep 172.30.31.3

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] aru 90
high delivered 90 received flag 1

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Did not
need to originate any messages in recovery.

Aug 16 16:59:15 gmhcluster2 kernel: dlm: closing connection
to node 1

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Sending
initial ORF token

Aug 16 16:59:15 gmhcluster2 clurgmgrd: [4132]: <info>
Executing /usr/sbin/dc-cluster-resourceagent start

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] CLM
CONFIGURATION CHANGE

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] New
Configuration:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM 
]      r(0) ip(172.30.31.4)

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM 
]      r(0) ip(172.30.31.5)

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]
Members Left:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM 
]      r(0) ip(172.30.31.3)

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]
Members Joined:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] CLM
CONFIGURATION CHANGE

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] New
Configuration:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM 
]      r(0) ip(172.30.31.4)

Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent:
[2492] INFO - ACTIVESITE in state started

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM 
]      r(0) ip(172.30.31.5)

Aug 16 16:59:15 gmhcluster2
/usr/sbin/dc-cluster-resourceagent: [2492] INFO - ACTIVESITE service is running
- ACTIVESITE in state started

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]
Members Left:

Aug 16 16:59:15 gmhcluster2
/usr/sbin/dc-cluster-resourceagent: [2492] INFO - Service MASTERVIP: Performing
precautionary updates to databases before setting up VIP

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]
Members Joined:

Aug 16 16:59:15 gmhcluster2
/usr/sbin/dc-cluster-resourceagent: [2492] INFO - Service MASTERVIP: running
precautionary update [touch /tmp/dc-set-dr-counters] 

Aug 16 16:59:15 gmhcluster2 openais[4084]: [SYNC ] This node
is within the primary component and will provide service.

Aug 16 16:59:15 gmhcluster2
/usr/sbin/dc-cluster-resourceagent: [2492] INFO - Service MASTERVIP: running
precautionary update [touch /tmp/dc-set-dr-auto-increments]

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering
OPERATIONAL state.

Aug 16 16:59:16 gmhcluster2 openais[4084]: [CLM  ] got
nodejoin message 172.30.31.4

Aug 16 16:59:16 gmhcluster2 openais[4084]: [CLM  ] got
nodejoin message 172.30.31.5

Aug 16 16:59:16 gmhcluster2 openais[4084]: [CPG  ] got
joinlist message from node 3

Aug 16 16:59:16 gmhcluster2 clurgmgrd: [4132]: <info>
Adding IPv4 address 172.30.30.93/24 to bond0

Aug 16 16:59:16 gmhcluster2 openais[4084]: [CPG  ] got
joinlist message from node 2

Aug 16 16:59:17 gmhcluster2 clurgmgrd[4132]: <notice>
Service service:MASTERVIP started

Aug 16 16:59:17 gmhcluster2 clurgmgrd[4132]: <notice>
Stopping service service:MASTERVIP

Aug 16 16:59:17 gmhcluster2 clurgmgrd: [4132]: <info>
Removing IPv4 address 172.30.30.93/24 from bond0

Aug 16 16:59:27 gmhcluster2 clurgmgrd: [4132]: <info>
Executing /usr/sbin/dc-cluster-resourceagent stop

Aug 16 16:59:27 gmhcluster2 clurgmgrd[4132]: <notice>
Service service:MASTERVIP is stopped

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster