Re: kill -TERM does not stop corosync daemon

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Nov 27, 2012 3:28 PM, "Jan Friesse" <jfriesse@xxxxxxxxxx> wrote:
>
> Jason,
> actually shutdown process works in following way:
> - stop accepting new IPC connections
> - shutdown services in order -> kill clients of given service
> - final shutdown
>
> When token is lost in the process of shutdown, corosync tries to create
> new membership. It can happen, that in the middle of recovery (or
> gather), token is lost again and this can result in pretty long time
> between start of shutdown and actual shutdown.
>
> So first question. What is your token timeout?

Hi Jan, Thanks for the reply. My token timeout is the default value:1sec.

Actually I did not change any configuration value about time.

>
> jason napsal(a):
> > Update again,
> > I checked the log twice and found that there were only one node in
> > configuration so mcast message sent from the node was not sent to nic but
> > just immediately added into regular_sort_queue according to
> > orf_token_mcast(),but seems node can not got orf token to got chance to
> > deliver them. But after the new confchg arrived, I am sure those old
> > message were then delivered,because I saw logs belongs to the old configure
> > did print out after the new confchg was created.
> >
> > It seems the problem is that the old configure can not received or process
> > org token which results in corosync can be stopped and message can not be
> > delivered I guess. But I see no token timeout log when it
>
> I'm really sorry, but I was not able to decrypt ^^^ sentence.
>
> > happenning.Actually, there is no log came out at all during the time that I
> > was trying to kill corosync.
>
> Logging in this area of code is pretty bad.
>
> Are you able to reproduce this situation reliably?

Sorry althrough I am trying to but I still can not reproduce it. It seems it was distroyed by my gdb debuging but I am not sure. What I did in gdb is just

1) attach to corosync
2) c then I saw gdb print "Program received signal SIGUSR1..."
3) So I let gdb to ignore this signal ny execute "handle SIGUSR1 noprint"
4) c again .
5) gdb printed immediately the following lines:
Continuing.
[Thread 1082595648 (zombie) exited]

Program exited normally.

Then I found some new log that generated during my time spent in gdb  :

Nov 23 19:58:25 corosync [TOTEM] Process pause detected for 18762 ms, flushing membership messages.
Nov 23 19:58:25 corosync [SERV] Unloading all Corosync service engines.
Nov 23 19:58:45 corosync [SERV] Service engine unloaded: corosync extended virtual synchrony service.
Nov 23 19:58:45 corosync [CLM] CLM CONFIGURATION CHANGE
Nov 23 19:58:45 corosync [CLM] New Configuration:
Nov 23 19:58:45 corosync [CLM]           r(0)   ip (128.0.31.1)
Nov 23 19:58:45 corosync [CLM] Members Left:
Nov 23 19:58:45 corosync [CLM] Members Joined:
...
>From the following log I am sure that some mcast message from the previous configuration was delivered in this new configuration.

>
> Regards,
>   Honza
>
> > On Nov 26, 2012 11:07 AM, "jason" <huzhijiang@xxxxxxxxx> wrote:
> >
> >> Update.
> >> According to the AMF log about a timeout, I can confirm that the node
> >> which had this issue could not receive mcast message even sent by itself at
> >> that time.  But I do not understand why it can receive JOIN message which
> >> result in pause detection.
> >>  在 2012-11-25 下午9:39,"jason" <huzhijiang@xxxxxxxxx>写道:
> >>
> >>> Hi All,
> >>> I currently encountered a publem with corosync-1.4.4 that kill -TERM does
> >>> not stop corosync daemon. What I can confirm are:
> >>> 1)  The thread of corosync_exit_thread_handler() is done and disappeared
> >>> (confirmed with gdb info threads).  So the hooks into sched_work() which
> >>> gets fired on token_send may not got chance to run(no token to send?)
> >>> 2) I do not have firewall running when this ocurred.
> >>> 3) No consensus timeout log before this publem happend.
> >>> 4) I run gdb to attach to corosync, wasted some seconds, and when I
> >>> continue to run it, I saw pause detection timer triggered(by check log),and
> >>> after about 20 seconds, through the log I see both new confchg and service
> >>> unload  hanppend simultaneously and finally corosync exited normally. I
> >>> think it is the new token created by the new ring to make corosync exits
> >>> finally,but I can not tell if the creation of new ring is influenced by my
> >>> running of gdb or not.
> >>>
> >>> This issue has not been reproduced but I am tring to. Could you help me
> >>> to take look into this issue please?
> >>>
> >>> Many thanks!
> >>>
> >>>
> >
> >
> >
> > _______________________________________________
> > discuss mailing list
> > discuss@xxxxxxxxxxxx
> > http://lists.corosync.org/mailman/listinfo/discuss
>

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux