Re: coroipcs_ipc_service_exit() dead loop

jason <huzhijiang@xxxxxxxxx> · Thu, 25 Apr 2013 23:30:25 +0800

Hi Honza,
Thank you for the reply. Until now, this deadloop occured only one time. It must really hard to reproduce I think. But I can figure out what I  did when it happened:


1) Register a lot of (50 in my environment) confdb clients which do confdb_initialize() then confdb_track_changes(), each one is a thread. 
2) All confdb clients track changes for one object.
3) A service (AMF for example), on the server side(I mean corosync daemon)  make changes to that object frequently(in my environment, it is every time that got configuration change). 
4) After client got change notification, they also make changes to that object to.
5) At the same time that the object is being changed, restart corosync daemon frequently(kill -TERM then start again).


I have got your patch and it looks that it works in the same way as the workaround I currently used in my environment isn't it? Below is my workaround patch:

diff -ruNp corosync-1.4.5-orig/services/confdb.c corosync-1.4.5/services/confdb.c

--- corosync-1.4.5-orig/services/confdb.c       2013-03-14 20:32:00.664972793 +0800
+++ corosync-1.4.5/services/confdb.c    2013-04-25 22:55:53.851233577 +0800
@@ -350,6 +350,10 @@ __attribute__ ((constructor)) static voi

 static int confdb_exec_exit_fn(void)
 {
+       objdb_notify_dispatch(0, /* useless */
+                             notify_pipe[0],
+                             POLLIN,
+                             NULL);
        api->poll_dispatch_delete(api->poll_handle_get(), notify_pipe[0]);
        close(notify_pipe[0]);
        close(notify_pipe[1]);



Please note that, I haven't reproduced the deadloop twice with or without my workaround patch currently. I will continue try to reproduce it and test the validity of your patch.


  


On Thu, Apr 25, 2013 at 10:43 PM, Jan Friesse <jfriesse@xxxxxxxxxx> wrote:

Jason,

thanks for analysis. It took me really quite a lot time to understand

WHAT is really happening, but I believe I've got it. I've created patch

"[PATCH] Free confdb message holder list on confdb exit". Can you please

give it try and paste results?



How was you able to hit that bug (I mean, do you have any reproducer?).



Regards,

  Honza



jason napsal(a):

> Sorry, in the previous mail, I didn't realize that

> after service_exit_schedwrk_handler() for confdb is done, the notify_pipe

> was closed, therefore, ipc_dispatch_send_from_poll_thread() won't increase

> conn->refcount.  But if below senario exists, dead loop still have chance

> to happen:

>

> 1. confdb_notify_lib_of_key_change()/confdb_notify_lib_of_new_object()/...

> ( before objdb_notify_dispatch() )

> 2. service_exit_schedwrk_handler()

> 3. service_unlink_schedwrk_handler() //deadloop!

>

>

>

> On Mon, Apr 22, 2013 at 10:29 PM, jason <huzhijiang@xxxxxxxxx> wrote:

>

>> Hi All,

>>

>> I encountered a dead looping at the following code:

>>

>> coroipcs_ipc_service_exit() {

>> ...

>> while (conn_info_destroy (conn_info) != -1)

>>  ;

>> }

>>

>> It happend when confdb service side was notifying library side about key

>> changing(or object creating/destroying) while corosync is unloading. When

>> it happend, i saw conn_info->refcount =3, and it was a confdb IPC

>> connection.

>>

>> By analysing the code I found that there is a gap

>> between service_exit_schedwrk_handler()

>> and service_unlink_schedwrk_handler(), and if confdb service side calls

>> confdb_notify_lib_of_key_change() in this gap (triggered by some other

>> service), the conn_info->refcount will be increased

>> by ipc_dispatch_send_from_poll_thread(). Then, when we are in

>> coroipcs_ipc_service_exit(), dead loop will happen.

>>

>> And more, after service_exit_schedwrk_handler() for confdb is

>> done, objdb_notify_dispatch() is unregistered from poll, thus, there is no

>> more chance to decrease conn->refcount after this(even we somehow omit the

>> dead loop).

>>

>> Above is my conclusion only by code analysis. I haven't got any idea to

>> correct it , even not sure if it is the root cause of the dead loop. Please

>> help.

>>

>> --

>> Yours,

>> Jason

>>

>

>

>

>

>

> _______________________________________________

> discuss mailing list

> discuss@xxxxxxxxxxxx

> http://lists.corosync.org/mailman/listinfo/discuss

>






-- 
Yours,
Jason

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss