Re: kill -TERM does not stop corosync daemon

jason <huzhijiang@xxxxxxxxx> · Tue, 27 Nov 2012 10:31:24 +0800

Update again,

I checked the log twice and found that there were only one node in configuration so mcast message sent from the node was not sent to nic but just immediately added into regular_sort_queue according to orf_token_mcast(),but seems node can not got orf token to got chance to deliver them. But after the new confchg arrived, I am sure those old message were then delivered,because I saw logs belongs to the old configure did print out after the new confchg was created.

It seems the problem is that the old configure can not received or process org token which results in corosync can be stopped and message can not be delivered I guess. But I see no token timeout log when it happenning.Actually, there is no log came out at all during the time that I was trying to kill corosync. 

On Nov 26, 2012 11:07 AM, "jason" <huzhijiang@xxxxxxxxx> wrote:

Update.

 According to the AMF log about a timeout, I can confirm that the node which had this issue could not receive mcast message even sent by itself at that time.  But I do not understand why it can receive JOIN message which result in pause detection.

在 2012-11-25 下午9:39，"jason" <huzhijiang@xxxxxxxxx>写道：

Hi All,

I currently encountered a publem with corosync-1.4.4 that kill -TERM does not stop corosync daemon. What I can confirm are:

1)  The thread of corosync_exit_thread_handler() is done and disappeared (confirmed with gdb info threads).  So the hooks into sched_work() which gets fired on token_send may not got chance to run(no token to send?)

2) I do not have firewall running when this ocurred.

3) No consensus timeout log before this publem happend.

4) I run gdb to attach to corosync, wasted some seconds, and when I continue to run it, I saw pause detection timer triggered(by check log),and after about 20 seconds, through the log I see both new confchg and service unload  hanppend simultaneously and finally corosync exited normally. I think it is the new token created by the new ring to make corosync exits finally,but I can not tell if the creation of new ring is influenced by my running of gdb or not.

This issue has not been reproduced but I am tring to. Could you help me to take look into this issue please? 
Many thanks!

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss