Re: shutdown seems to get hung up quite frequently

Jan Friesse <jfriesse@xxxxxxxxxx> · Mon, 25 Feb 2013 10:14:45 +0100

jason napsal(a):
> Hi Steven,
> 
> Do you have plan to port the new shutdown method in corosync-2.x back to
> corosync-1.4.x? When using corosync-1.4.5, we encountered  shutdown

It's almost impossible. Actually, shutdown sequence itself didn't
changed much. What did is usage of threads (or correctly said, no
threads in 2.x).

> corosync by using kill -3 failed several times. The latest one is because
> when issuing kill -3, corosync_exit_sem had not been initialized by
> sem_init(), so sem_post() in corosync_shutdown_request() failed to trigger
> corosync_exit_thread_handler() to work. The resolution I think is simply to
> call the sem_init() before we install signal handler. But as you say, if

Can you send patch?

> corosync-2.x has more stronger mechanism for shutdown, why not port it back
> to 1.4.x?
> On Feb 15, 2013 7:03 AM, "Steven Dake" <steven.dake@xxxxxxxxx> wrote:
> 

Honza

>>
>>
>> On Thu, Feb 14, 2013 at 1:23 PM, Brian J. Murrell <
>> brian.murrell@xxxxxxxxxxxxxxx> wrote:
>>
>>> On EL6, at least, trying to stop corosync (kill -TERM) seems to fail
>>> quite frequently with corosync seemingly just not wanting to take heed
>>> of the signal and exit.  corosync-cfgtool -H doesn't seem to work either
>>> and I just end up killing it with a SIGKILL.
>>>
>>> Shutdown has been a never-ending source of frustration for corosync, now
>> solved with the 2.x series :)
>>
>> The reason the TERM is not honored immediately is that Corosync wants to
>> shut down in an orderly fashion on a TERM by quiescing services and
>> shutting down cleanly with no pending messages.  Sometimes this is not
>> possible quickly because the network is flaky or blocked in some way (such
>> as iptables).
>>
>> I had thought we had sorted all this out for 1.4 series though, so if you
>> could provide more information on your corosync rpm version, that might be
>> helpful.
>>
>>
>>
>>> Is a SIGKILL really the only way to deal with this problem?  Should this
>>> need be codified into the initscript?  i.e. try SIGTERM and then SIGKILL
>>> after a timeout?  What's a reasonable timeout for SIGTERM to have
>>> worked?
>>>
>>>
>> sigterm should be honored by the corosync process rather then hacking
>> around with a sigkill.
>>
>>
>>> Cheers,
>>> b.
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>
>>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
>>
>>
> 
> 
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss