Re: shutdown seems to get hung up quite frequently

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Jan,

Here is my patch against corosync-1.4.5.

diff -ruNp corosync-1.4.5-orig/exec/main.c corosync-1.4.5/exec/main.c
--- corosync-1.4.5-orig/exec/main.c     2012-12-12 18:47:52.000000000 +0800
+++ corosync-1.4.5/exec/main.c  2013-02-26 20:48:48.937500000 +0800
@@ -1620,6 +1620,14 @@ int main (int argc, char **argv, char **
        log_printf (LOGSYS_LEVEL_NOTICE, "Corosync Cluster Engine
('%s'): started and ready to provide service.\n", VERSION);
        log_printf (LOGSYS_LEVEL_INFO, "Corosync built-in features:"
PACKAGE_FEATURES "\n");

+       /*
+        * Create exit sempahore.
+        */
+       res = sem_init (&corosync_exit_sem, 0, 0);
+       if (res != 0) {
+               log_printf (LOGSYS_LEVEL_ERROR, "Corosync Executive
couldn't create exit sempahore.\n");
+               corosync_exit_error (AIS_DONE_FATAL_ERR);
+       }

        (void)signal (SIGINT, sigintr_handler);
        (void)signal (SIGUSR2, sigusr2_handler);
@@ -1803,14 +1811,8 @@ int main (int argc, char **argv, char **
 // TODO what is this hack for? usleep(totem_config.token_timeout * 2000);

        /*
-        * Create semaphore and start "exit" thread
+        * Start "exit" thread
         */
-       res = sem_init (&corosync_exit_sem, 0, 0);
-       if (res != 0) {
-               log_printf (LOGSYS_LEVEL_ERROR, "Corosync Executive
couldn't create exit thread.\n");
-               corosync_exit_error (AIS_DONE_FATAL_ERR);
-       }
-
        res = pthread_create (&corosync_exit_thread, NULL,
corosync_exit_thread_handler, NULL);
        if (res != 0) {
                log_printf (LOGSYS_LEVEL_ERROR, "Corosync Executive
couldn't create exit thread.\n");


On Mon, Feb 25, 2013 at 5:14 PM, Jan Friesse <jfriesse@xxxxxxxxxx> wrote:
> jason napsal(a):
>> Hi Steven,
>>
>> Do you have plan to port the new shutdown method in corosync-2.x back to
>> corosync-1.4.x? When using corosync-1.4.5, we encountered  shutdown
>
> It's almost impossible. Actually, shutdown sequence itself didn't
> changed much. What did is usage of threads (or correctly said, no
> threads in 2.x).
>
>> corosync by using kill -3 failed several times. The latest one is because
>> when issuing kill -3, corosync_exit_sem had not been initialized by
>> sem_init(), so sem_post() in corosync_shutdown_request() failed to trigger
>> corosync_exit_thread_handler() to work. The resolution I think is simply to
>> call the sem_init() before we install signal handler. But as you say, if
>
> Can you send patch?
>
>> corosync-2.x has more stronger mechanism for shutdown, why not port it back
>> to 1.4.x?
>> On Feb 15, 2013 7:03 AM, "Steven Dake" <steven.dake@xxxxxxxxx> wrote:
>>
>
> Honza
>
>>>
>>>
>>> On Thu, Feb 14, 2013 at 1:23 PM, Brian J. Murrell <
>>> brian.murrell@xxxxxxxxxxxxxxx> wrote:
>>>
>>>> On EL6, at least, trying to stop corosync (kill -TERM) seems to fail
>>>> quite frequently with corosync seemingly just not wanting to take heed
>>>> of the signal and exit.  corosync-cfgtool -H doesn't seem to work either
>>>> and I just end up killing it with a SIGKILL.
>>>>
>>>> Shutdown has been a never-ending source of frustration for corosync, now
>>> solved with the 2.x series :)
>>>
>>> The reason the TERM is not honored immediately is that Corosync wants to
>>> shut down in an orderly fashion on a TERM by quiescing services and
>>> shutting down cleanly with no pending messages.  Sometimes this is not
>>> possible quickly because the network is flaky or blocked in some way (such
>>> as iptables).
>>>
>>> I had thought we had sorted all this out for 1.4 series though, so if you
>>> could provide more information on your corosync rpm version, that might be
>>> helpful.
>>>
>>>
>>>
>>>> Is a SIGKILL really the only way to deal with this problem?  Should this
>>>> need be codified into the initscript?  i.e. try SIGTERM and then SIGKILL
>>>> after a timeout?  What's a reasonable timeout for SIGTERM to have
>>>> worked?
>>>>
>>>>
>>> sigterm should be honored by the corosync process rather then hacking
>>> around with a sigkill.
>>>
>>>
>>>> Cheers,
>>>> b.
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list
>>>> discuss@xxxxxxxxxxxx
>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
>



-- 
Yours,
Jason
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss


[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux