Re: kill -TERM does not stop corosync daemon

Jan Friesse <jfriesse@xxxxxxxxxx> · Tue, 27 Nov 2012 09:16:01 +0100

Oh,
actually ... debugging corosync by gdb is almost impossible. Because
corosync is more like "real time" system, it really depends on many
timeouts and if you attach gdb (thus wait for quiet a long time),
SIGUSR1 is sent and this will totally change executable flow.

Only way how to debug corosync is to:
- watch blackbox (corosync-blackbox), debug messages, ...
- postmortem (core)

Honza

jason napsal(a):
> On Nov 27, 2012 3:28 PM, "Jan Friesse" <jfriesse@xxxxxxxxxx> wrote:
>>
>> Jason,
>> actually shutdown process works in following way:
>> - stop accepting new IPC connections
>> - shutdown services in order -> kill clients of given service
>> - final shutdown
>>
>> When token is lost in the process of shutdown, corosync tries to create
>> new membership. It can happen, that in the middle of recovery (or
>> gather), token is lost again and this can result in pretty long time
>> between start of shutdown and actual shutdown.
>>
>> So first question. What is your token timeout?
> 
> Hi Jan, Thanks for the reply. My token timeout is the default value:1sec.
> 
> Actually I did not change any configuration value about time.
> 
>>
>> jason napsal(a):
>>> Update again,
>>> I checked the log twice and found that there were only one node in
>>> configuration so mcast message sent from the node was not sent to nic
> but
>>> just immediately added into regular_sort_queue according to
>>> orf_token_mcast(),but seems node can not got orf token to got chance to
>>> deliver them. But after the new confchg arrived, I am sure those old
>>> message were then delivered,because I saw logs belongs to the old
> configure
>>> did print out after the new confchg was created.
>>>
>>> It seems the problem is that the old configure can not received or
> process
>>> org token which results in corosync can be stopped and message can not
> be
>>> delivered I guess. But I see no token timeout log when it
>>
>> I'm really sorry, but I was not able to decrypt ^^^ sentence.
>>
>>> happenning.Actually, there is no log came out at all during the time
> that I
>>> was trying to kill corosync.
>>
>> Logging in this area of code is pretty bad.
>>
>> Are you able to reproduce this situation reliably?
> 
> Sorry althrough I am trying to but I still can not reproduce it. It seems
> it was distroyed by my gdb debuging but I am not sure. What I did in gdb is
> just
> 
> 1) attach to corosync
> 2) c then I saw gdb print "Program received signal SIGUSR1..."
> 3) So I let gdb to ignore this signal ny execute "handle SIGUSR1 noprint"
> 4) c again .
> 5) gdb printed immediately the following lines:
> Continuing.
> [Thread 1082595648 (zombie) exited]
> 
> Program exited normally.
> 
> Then I found some new log that generated during my time spent in gdb  :
> 
> Nov 23 19:58:25 corosync [TOTEM] Process pause detected for 18762 ms,
> flushing membership messages.
> Nov 23 19:58:25 corosync [SERV] Unloading all Corosync service engines.
> Nov 23 19:58:45 corosync [SERV] Service engine unloaded: corosync extended
> virtual synchrony service.
> Nov 23 19:58:45 corosync [CLM] CLM CONFIGURATION CHANGE
> Nov 23 19:58:45 corosync [CLM] New Configuration:
> Nov 23 19:58:45 corosync [CLM]           r(0)   ip (128.0.31.1)
> Nov 23 19:58:45 corosync [CLM] Members Left:
> Nov 23 19:58:45 corosync [CLM] Members Joined:
> ...
> From the following log I am sure that some mcast message from the previous
> configuration was delivered in this new configuration.
> 
>>
>> Regards,
>>   Honza
>>
>>> On Nov 26, 2012 11:07 AM, "jason" <huzhijiang@xxxxxxxxx> wrote:
>>>
>>>> Update.
>>>> According to the AMF log about a timeout, I can confirm that the node
>>>> which had this issue could not receive mcast message even sent by
> itself at
>>>> that time.  But I do not understand why it can receive JOIN message
> which
>>>> result in pause detection.
>>>>  在 2012-11-25 下午9:39，"jason" <huzhijiang@xxxxxxxxx>写道：
>>>>
>>>>> Hi All,
>>>>> I currently encountered a publem with corosync-1.4.4 that kill -TERM
> does
>>>>> not stop corosync daemon. What I can confirm are:
>>>>> 1)  The thread of corosync_exit_thread_handler() is done and
> disappeared
>>>>> (confirmed with gdb info threads).  So the hooks into sched_work()
> which
>>>>> gets fired on token_send may not got chance to run(no token to send?)
>>>>> 2) I do not have firewall running when this ocurred.
>>>>> 3) No consensus timeout log before this publem happend.
>>>>> 4) I run gdb to attach to corosync, wasted some seconds, and when I
>>>>> continue to run it, I saw pause detection timer triggered(by check
> log),and
>>>>> after about 20 seconds, through the log I see both new confchg and
> service
>>>>> unload  hanppend simultaneously and finally corosync exited normally.
> I
>>>>> think it is the new token created by the new ring to make corosync
> exits
>>>>> finally,but I can not tell if the creation of new ring is influenced
> by my
>>>>> running of gdb or not.
>>>>>
>>>>> This issue has not been reproduced but I am tring to. Could you help
> me
>>>>> to take look into this issue please?
>>>>>
>>>>> Many thanks!
>>>>>
>>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss