Re: shutdown seems to get hung up quite frequently

jason <huzhijiang@xxxxxxxxx> · Thu, 21 Feb 2013 12:17:31 +0800

Hi Andrew,
No I just run openais-1.1.4 and corosync-1.4.4 and I start/stop corosync daemon frequently by using a shell script. 
On Feb 21, 2013 11:57 AM, "Andrew Beekhof" <andrew@xxxxxxxxxxx> wrote:

Are you using corosync with pacemaker when this happens?

On Thu, Feb 21, 2013 at 2:27 PM, jason <huzhijiang@xxxxxxxxx> wrote:

> Hi Steven,

>

> Do you have plan to port the new shutdown method in corosync-2.x back to

> corosync-1.4.x? When using corosync-1.4.5, we encountered  shutdown corosync

> by using kill -3 failed several times. The latest one is because when

> issuing kill -3, corosync_exit_sem had not been initialized by sem_init(),

> so sem_post() in corosync_shutdown_request() failed to trigger

> corosync_exit_thread_handler() to work. The resolution I think is simply to

> call the sem_init() before we install signal handler. But as you say, if

> corosync-2.x has more stronger mechanism for shutdown, why not port it back

> to 1.4.x?

>

> On Feb 15, 2013 7:03 AM, "Steven Dake" <steven.dake@xxxxxxxxx> wrote:

>>

>>

>>

>> On Thu, Feb 14, 2013 at 1:23 PM, Brian J. Murrell

>> <brian.murrell@xxxxxxxxxxxxxxx> wrote:

>>>

>>> On EL6, at least, trying to stop corosync (kill -TERM) seems to fail

>>> quite frequently with corosync seemingly just not wanting to take heed

>>> of the signal and exit.  corosync-cfgtool -H doesn't seem to work either

>>> and I just end up killing it with a SIGKILL.

>>>

>> Shutdown has been a never-ending source of frustration for corosync, now

>> solved with the 2.x series :)

>>

>> The reason the TERM is not honored immediately is that Corosync wants to

>> shut down in an orderly fashion on a TERM by quiescing services and shutting

>> down cleanly with no pending messages.  Sometimes this is not possible

>> quickly because the network is flaky or blocked in some way (such as

>> iptables).

>>

>> I had thought we had sorted all this out for 1.4 series though, so if you

>> could provide more information on your corosync rpm version, that might be

>> helpful.

>>

>>

>>>

>>> Is a SIGKILL really the only way to deal with this problem?  Should this

>>> need be codified into the initscript?  i.e. try SIGTERM and then SIGKILL

>>> after a timeout?  What's a reasonable timeout for SIGTERM to have

>>> worked?

>>>

>>

>> sigterm should be honored by the corosync process rather then hacking

>> around with a sigkill.

>>

>>>

>>> Cheers,

>>> b.

>>>

>>>

>>>

>>>

>>> _______________________________________________

>>> discuss mailing list

>>> discuss@xxxxxxxxxxxx

>>> http://lists.corosync.org/mailman/listinfo/discuss

>>

>>

>>

>> _______________________________________________

>> discuss mailing list

>> discuss@xxxxxxxxxxxx

>> http://lists.corosync.org/mailman/listinfo/discuss

>>

>

> _______________________________________________

> discuss mailing list

> discuss@xxxxxxxxxxxx

> http://lists.corosync.org/mailman/listinfo/discuss

>

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss