Re: Unkillable clurgmgrd

"Alex Kompel" <alex.kompel@xxxxxxxxxxx> · Mon, 12 Nov 2007 14:47:18 -0800

On 11/12/07, Lon Hohberger <lhh@xxxxxxxxxx> wrote:
On Sun, 2007-11-11 at 23:57 +0100, Jos Vos wrote:
> Hi,
>
> I have a node that has an unkillable (kill -9 doesn't work) clurgmgrd
> running.  I have fenced it now for the third time, with the same

> result after startup...
>
> Stracing clutstat gives:
>
> [...]
> socket(PF_FILE, SOCK_STREAM, 0)         = 5
> connect(5, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk"}, 110) = -1 ENOENT (No such file or directory)

> close(5)                                = 0
> dup(2)                                  = 5
> fcntl(5, F_GETFL)                       = 0x8002 (flags O_RDWR|O_LARGEFILE)
> fstat(5, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0

> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaac000
> lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
> write(5, "msg_open: No such file or direct"..., 36msg_open: No such file or directory

> ) = 36
> close(5)                                = 0
> munmap(0x2aaaaaaac000, 4096)            = 0
> [...]
>
> How to get this node back up again???
>
> This is on a RHEL 
5.0 clone.

If it's unkillable, it's stuck waiting on the kernel for something.

echo 1 > /proc/sys/kernel/sysrq
echo t > /proc/sysrq-trigger

dmesg > foo.out

reply + attach foo.out
 ;)

I observed a similar problem on the test cluster. It appears the clurgmgrd deadlocks in some cases in groups.c:count_resource_groups(). It does not happen every time but it is reproducible. Surviving node calls rg_lock(service:mysql) @ 
groups.c:101 and gets stuck. The other node resource manager waits indefinitely for the lock:

{3965} rg_lock(service:mysql) @ groups.c:101
[3460] debug: Sending service states to CTX0xa2a7fd0
no key for rg="service:mysql"

no key for rg="service:test"
[3460] debug: Sending node states to CTX0xa2a7fd0
[3460] debug: Sending service states to CTX0xa2a7fd0
no key for rg="service:mysql"
no key for rg="service:test"

To the original poster: the surviving node clurgmgrd is "unkillable" as well. 
You can try to reboot the surviving node - it will release the lock and 
resource manager on the fenced node will be unblocked and start just fine.

Unfortunately, once you reboot the node the situation may reverse (resource manager will hang on the rebooted node).

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster