Re: Services getting stuck on node

Colin Simpson <Colin.Simpson@xxxxxxxxxx> · Sat, 1 Sep 2012 12:56:47 +0000

Thanks for getting back.

I'll try the debug shutdown with that command.

Though I think the "failed to stop cleanly" is far from clear what that means. The node it was running on has gone (was fenced) so there was nothing to stop before starting on this node.

Thanks

Colin

From: linux-cluster-bounces@xxxxxxxxxx [linux-cluster-bounces@xxxxxxxxxx] on behalf of emmanuel segura [emi2fast@xxxxxxxxx]

Sent: 01 September 2012 11:04

To: linux clustering

Subject: Re:  Services getting stuck on node

Hello Colin

maybe your service doesn't switch because this happen

======================================================

Aug 31 17:19:49 rgmanager #13: Service service:nfsdprj failed to stop cleanly

Aug 31 17:19:49 rgmanager #13: Service service:httpd failed to stop cleanly

======================================================

for debug your service stop, you can use rg_test test /etc/cluster/cluster.conf stop service <NAME_OF_SERVICE>

for help you think is more easy if you show your cluster.conf

Thanks :-)

2012/9/1 Colin Simpson <Colin.Simpson@xxxxxxxxxx>

Hi

I had a strange issue this afternoon. One of my cluster nodes died (possible hw fault or driver issue). But the other node failed to take a number of it's services (2 node cluster), when it was successfully fenced.

The clustat indicated that the services were on still on the original node (started) but the top lines correctly stated that the node was "offline".  The rgmanager log says for this event:

Aug 31 17:19:30 rgmanager [ip] Link detected on bond0

Aug 31 17:19:30 rgmanager [ip] Local ping to 10.10.1.45 succeeded

Aug 31 17:19:37 rgmanager State change: bld1uxn1i DOWN

Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.46, Level 10

Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.45, Level 0

Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.33, Level 0

Aug 31 17:19:49 rgmanager [ip] 10.10.1.46 present on bond0

Aug 31 17:19:49 rgmanager [ip] Checking 10.10.1.43, Level 0

Aug 31 17:19:49 rgmanager [ip] 10.10.1.45 present on bond0

Aug 31 17:19:49 rgmanager [ip] 10.10.1.33 present on bond0

Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected

Aug 31 17:19:49 rgmanager [ip] 10.10.1.43 present on bond0

Aug 31 17:19:49 rgmanager Taking over service service:nfsdprj from down member bld1uxn1i

Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected

Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected

Aug 31 17:19:49 rgmanager #47: Failed changing service status

Aug 31 17:19:49 rgmanager Taking over service service:httpd from down member bld1uxn1i

Aug 31 17:19:49 rgmanager [ip] Link detected on bond0

Aug 31 17:19:49 rgmanager [ip] Link for bond0: Detected

Aug 31 17:19:49 rgmanager [ip] Link detected on bond0

Aug 31 17:19:49 rgmanager [ip] Link detected on bond0

Aug 31 17:19:49 rgmanager #47: Failed changing service status

Aug 31 17:19:49 rgmanager [ip] Local ping to 10.10.1.46 succeeded

Aug 31 17:19:49 rgmanager [ip] Link detected on bond0

Aug 31 17:19:49 rgmanager #13: Service service:nfsdprj failed to stop cleanly

Aug 31 17:19:49 rgmanager #13: Service service:httpd failed to stop cleanly

A couple of other services did successfully switch after this.

I have seem this a few times (randomly) on various clusters since around the time of upgrading to 6.3 from 6.2 (services refusing to cleanly stop on a node). It's hard to reproduce and when down we usually just want a restart as fast as possible (thereby limiting
 time for debugging).

How can I see what is causing the "#47: Failed changing service status" or any more debugging we can turn on in rgmanager to help with this?

Or better still has anyone else seen anything like this?

Thanks

Colin

________________________________

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended
 recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the
 original.

--

Linux-cluster mailing list

Linux-cluster@xxxxxxxxxx

https://www.redhat.com/mailman/listinfo/linux-cluster

-- 

esta es mi vida e me la vivo hasta que dios quiera

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed. If you are not the original recipient or the person responsible for delivering the email to the intended
 recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the
 original.

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster