Re: kill -TERM does not stop corosync daemon

jason <huzhijiang@xxxxxxxxx> · Tue, 27 Nov 2012 16:08:43 +0800

On Nov 27, 2012 3:28 PM, "Jan Friesse" <jfriesse@xxxxxxxxxx> wrote:

>

> Jason,

> actually shutdown process works in following way:

> - stop accepting new IPC connections

> - shutdown services in order -> kill clients of given service

> - final shutdown

>

> When token is lost in the process of shutdown, corosync tries to create

> new membership. It can happen, that in the middle of recovery (or

> gather), token is lost again and this can result in pretty long time

> between start of shutdown and actual shutdown.

>

> So first question. What is your token timeout?
Hi Jan, Thanks for the reply. My token timeout is the default value:1sec.
Actually I did not change any configuration value about time.

>

> jason napsal(a):

> > Update again,

> > I checked the log twice and found that there were only one node in

> > configuration so mcast message sent from the node was not sent to nic but

> > just immediately added into regular_sort_queue according to

> > orf_token_mcast(),but seems node can not got orf token to got chance to

> > deliver them. But after the new confchg arrived, I am sure those old

> > message were then delivered,because I saw logs belongs to the old configure

> > did print out after the new confchg was created.

> >

> > It seems the problem is that the old configure can not received or process

> > org token which results in corosync can be stopped and message can not be

> > delivered I guess. But I see no token timeout log when it

>

> I'm really sorry, but I was not able to decrypt ^^^ sentence.

>

> > happenning.Actually, there is no log came out at all during the time that I

> > was trying to kill corosync.

>

> Logging in this area of code is pretty bad.

>

> Are you able to reproduce this situation reliably?
Sorry althrough I am trying to but I still can not reproduce it. It seems it was distroyed by my gdb debuging but I am not sure. What I did in gdb is just 
1) attach to corosync 

2) c then I saw gdb print "Program received signal SIGUSR1..." 

3) So I let gdb to ignore this signal ny execute "handle SIGUSR1 noprint" 

4) c again .

5) gdb printed immediately the following lines:

Continuing.

[Thread 1082595648 (zombie) exited]
Program exited normally.
Then I found some new log that generated during my time spent in gdb  :
Nov 23 19:58:25 corosync [TOTEM] Process pause detected for 18762 ms, flushing membership messages.

 Nov 23 19:58:25 corosync [SERV] Unloading all Corosync service engines.

 Nov 23 19:58:45 corosync [SERV] Service engine unloaded: corosync extended virtual synchrony service.

 Nov 23 19:58:45 corosync [CLM] CLM CONFIGURATION CHANGE 

 Nov 23 19:58:45 corosync [CLM] New Configuration: 

 Nov 23 19:58:45 corosync [CLM]           r(0)   ip (128.0.31.1)

 Nov 23 19:58:45 corosync [CLM] Members Left: 

 Nov 23 19:58:45 corosync [CLM] Members Joined:

...

>From the following log I am sure that some mcast message from the previous configuration was delivered in this new configuration. 
>

> Regards,

>   Honza

>

> > On Nov 26, 2012 11:07 AM, "jason" <huzhijiang@xxxxxxxxx> wrote:

> >

> >> Update.

> >> According to the AMF log about a timeout, I can confirm that the node

> >> which had this issue could not receive mcast message even sent by itself at

> >> that time.  But I do not understand why it can receive JOIN message which

> >> result in pause detection.

> >>  在 2012-11-25 下午9:39，"jason" <huzhijiang@xxxxxxxxx>写道：

> >>

> >>> Hi All,

> >>> I currently encountered a publem with corosync-1.4.4 that kill -TERM does

> >>> not stop corosync daemon. What I can confirm are:

> >>> 1)  The thread of corosync_exit_thread_handler() is done and disappeared

> >>> (confirmed with gdb info threads).  So the hooks into sched_work() which

> >>> gets fired on token_send may not got chance to run(no token to send?)

> >>> 2) I do not have firewall running when this ocurred.

> >>> 3) No consensus timeout log before this publem happend.

> >>> 4) I run gdb to attach to corosync, wasted some seconds, and when I

> >>> continue to run it, I saw pause detection timer triggered(by check log),and

> >>> after about 20 seconds, through the log I see both new confchg and service

> >>> unload  hanppend simultaneously and finally corosync exited normally. I

> >>> think it is the new token created by the new ring to make corosync exits

> >>> finally,but I can not tell if the creation of new ring is influenced by my

> >>> running of gdb or not.

> >>>

> >>> This issue has not been reproduced but I am tring to. Could you help me

> >>> to take look into this issue please?

> >>>

> >>> Many thanks!

> >>>

> >>>

> >

> >

> >

> > _______________________________________________

> > discuss mailing list

> > discuss@xxxxxxxxxxxx

> > http://lists.corosync.org/mailman/listinfo/discuss

>

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss