Re: shutdown seems to get hung up quite frequently

Jan Friesse <jfriesse@xxxxxxxxxx> · Wed, 27 Feb 2013 16:15:10 +0100

Thanks for patch!

Ack + I've pushed it.

Regards,
  Honza

jason napsal(a):
> Hi Jan,
> 
> Here is my patch against corosync-1.4.5.
> 
> diff -ruNp corosync-1.4.5-orig/exec/main.c corosync-1.4.5/exec/main.c
> --- corosync-1.4.5-orig/exec/main.c     2012-12-12 18:47:52.000000000 +0800
> +++ corosync-1.4.5/exec/main.c  2013-02-26 20:48:48.937500000 +0800
> @@ -1620,6 +1620,14 @@ int main (int argc, char **argv, char **
>         log_printf (LOGSYS_LEVEL_NOTICE, "Corosync Cluster Engine
> ('%s'): started and ready to provide service.\n", VERSION);
>         log_printf (LOGSYS_LEVEL_INFO, "Corosync built-in features:"
> PACKAGE_FEATURES "\n");
> 
> +       /*
> +        * Create exit sempahore.
> +        */
> +       res = sem_init (&corosync_exit_sem, 0, 0);
> +       if (res != 0) {
> +               log_printf (LOGSYS_LEVEL_ERROR, "Corosync Executive
> couldn't create exit sempahore.\n");
> +               corosync_exit_error (AIS_DONE_FATAL_ERR);
> +       }
> 
>         (void)signal (SIGINT, sigintr_handler);
>         (void)signal (SIGUSR2, sigusr2_handler);
> @@ -1803,14 +1811,8 @@ int main (int argc, char **argv, char **
>  // TODO what is this hack for? usleep(totem_config.token_timeout * 2000);
> 
>         /*
> -        * Create semaphore and start "exit" thread
> +        * Start "exit" thread
>          */
> -       res = sem_init (&corosync_exit_sem, 0, 0);
> -       if (res != 0) {
> -               log_printf (LOGSYS_LEVEL_ERROR, "Corosync Executive
> couldn't create exit thread.\n");
> -               corosync_exit_error (AIS_DONE_FATAL_ERR);
> -       }
> -
>         res = pthread_create (&corosync_exit_thread, NULL,
> corosync_exit_thread_handler, NULL);
>         if (res != 0) {
>                 log_printf (LOGSYS_LEVEL_ERROR, "Corosync Executive
> couldn't create exit thread.\n");
> 
> 
> On Mon, Feb 25, 2013 at 5:14 PM, Jan Friesse <jfriesse@xxxxxxxxxx> wrote:
>> jason napsal(a):
>>> Hi Steven,
>>>
>>> Do you have plan to port the new shutdown method in corosync-2.x back to
>>> corosync-1.4.x? When using corosync-1.4.5, we encountered  shutdown
>>
>> It's almost impossible. Actually, shutdown sequence itself didn't
>> changed much. What did is usage of threads (or correctly said, no
>> threads in 2.x).
>>
>>> corosync by using kill -3 failed several times. The latest one is because
>>> when issuing kill -3, corosync_exit_sem had not been initialized by
>>> sem_init(), so sem_post() in corosync_shutdown_request() failed to trigger
>>> corosync_exit_thread_handler() to work. The resolution I think is simply to
>>> call the sem_init() before we install signal handler. But as you say, if
>>
>> Can you send patch?
>>
>>> corosync-2.x has more stronger mechanism for shutdown, why not port it back
>>> to 1.4.x?
>>> On Feb 15, 2013 7:03 AM, "Steven Dake" <steven.dake@xxxxxxxxx> wrote:
>>>
>>
>> Honza
>>
>>>>
>>>>
>>>> On Thu, Feb 14, 2013 at 1:23 PM, Brian J. Murrell <
>>>> brian.murrell@xxxxxxxxxxxxxxx> wrote:
>>>>
>>>>> On EL6, at least, trying to stop corosync (kill -TERM) seems to fail
>>>>> quite frequently with corosync seemingly just not wanting to take heed
>>>>> of the signal and exit.  corosync-cfgtool -H doesn't seem to work either
>>>>> and I just end up killing it with a SIGKILL.
>>>>>
>>>>> Shutdown has been a never-ending source of frustration for corosync, now
>>>> solved with the 2.x series :)
>>>>
>>>> The reason the TERM is not honored immediately is that Corosync wants to
>>>> shut down in an orderly fashion on a TERM by quiescing services and
>>>> shutting down cleanly with no pending messages.  Sometimes this is not
>>>> possible quickly because the network is flaky or blocked in some way (such
>>>> as iptables).
>>>>
>>>> I had thought we had sorted all this out for 1.4 series though, so if you
>>>> could provide more information on your corosync rpm version, that might be
>>>> helpful.
>>>>
>>>>
>>>>
>>>>> Is a SIGKILL really the only way to deal with this problem?  Should this
>>>>> need be codified into the initscript?  i.e. try SIGTERM and then SIGKILL
>>>>> after a timeout?  What's a reasonable timeout for SIGTERM to have
>>>>> worked?
>>>>>
>>>>>
>>>> sigterm should be honored by the corosync process rather then hacking
>>>> around with a sigkill.
>>>>
>>>>
>>>>> Cheers,
>>>>> b.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list
>>>>> discuss@xxxxxxxxxxxx
>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list
>>>> discuss@xxxxxxxxxxxx
>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>
> 
> 
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss