Re: kill -TERM does not stop corosync daemon

Jan Friesse <jfriesse@xxxxxxxxxx> · Tue, 27 Nov 2012 08:28:29 +0100

Jason,
actually shutdown process works in following way:
- stop accepting new IPC connections
- shutdown services in order -> kill clients of given service
- final shutdown

When token is lost in the process of shutdown, corosync tries to create
new membership. It can happen, that in the middle of recovery (or
gather), token is lost again and this can result in pretty long time
between start of shutdown and actual shutdown.

So first question. What is your token timeout?

jason napsal(a):
> Update again,
> I checked the log twice and found that there were only one node in
> configuration so mcast message sent from the node was not sent to nic but
> just immediately added into regular_sort_queue according to
> orf_token_mcast(),but seems node can not got orf token to got chance to
> deliver them. But after the new confchg arrived, I am sure those old
> message were then delivered,because I saw logs belongs to the old configure
> did print out after the new confchg was created.
> 
> It seems the problem is that the old configure can not received or process
> org token which results in corosync can be stopped and message can not be
> delivered I guess. But I see no token timeout log when it

I'm really sorry, but I was not able to decrypt ^^^ sentence.

> happenning.Actually, there is no log came out at all during the time that I
> was trying to kill corosync.

Logging in this area of code is pretty bad.

Are you able to reproduce this situation reliably?

Regards,
  Honza

> On Nov 26, 2012 11:07 AM, "jason" <huzhijiang@xxxxxxxxx> wrote:
> 
>> Update.
>> According to the AMF log about a timeout, I can confirm that the node
>> which had this issue could not receive mcast message even sent by itself at
>> that time.  But I do not understand why it can receive JOIN message which
>> result in pause detection.
>>  在 2012-11-25 下午9:39，"jason" <huzhijiang@xxxxxxxxx>写道：
>>
>>> Hi All,
>>> I currently encountered a publem with corosync-1.4.4 that kill -TERM does
>>> not stop corosync daemon. What I can confirm are:
>>> 1)  The thread of corosync_exit_thread_handler() is done and disappeared
>>> (confirmed with gdb info threads).  So the hooks into sched_work() which
>>> gets fired on token_send may not got chance to run(no token to send?)
>>> 2) I do not have firewall running when this ocurred.
>>> 3) No consensus timeout log before this publem happend.
>>> 4) I run gdb to attach to corosync, wasted some seconds, and when I
>>> continue to run it, I saw pause detection timer triggered(by check log),and
>>> after about 20 seconds, through the log I see both new confchg and service
>>> unload  hanppend simultaneously and finally corosync exited normally. I
>>> think it is the new token created by the new ring to make corosync exits
>>> finally,but I can not tell if the creation of new ring is influenced by my
>>> running of gdb or not.
>>>
>>> This issue has not been reproduced but I am tring to. Could you help me
>>> to take look into this issue please?
>>>
>>> Many thanks!
>>>
>>>
> 
> 
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss