Re: Interesting deadlock bug found causing three threads to deadlock on the PJSUA, UA, and transaction locks.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi David,

Ah, right, I completely forgot that application can obtain PJSUA lock in the callback, so even though the library doesn't technically cause the deadlock, it's still not a practical solution.

Your initial analysis seems to be correct, it's just that last time, I didn't find a solution yet to prevent the transaction from getting destroyed (as previously reported in ticket #1706 (https://trac.pjsip.org/repos/ticket/1706), but now I think I do (in the attached patch).

So, thanks for testing it before and please help us test the patch again.

Best regards,
Ming

On Tue, Mar 21, 2017 at 4:05 AM, David Richards <david.brian.richards@xxxxxxxxx> wrote:
Hi Ming.

Thanks for the quick response. However your patch doesn't fix the problem, it just moved it as shown in the Thread 1 stack backtrace below.  Your patch stopped the PJSUA lock from being obtained in pjsua_call_on_state_changed()(frame 7), however, just a few frames later (frame 5), my application calls pjsua_call_on_state_changed() from on_call_state() where an attempt on the PJSUA lock is made and deadlocked.

I didn't include the stack back traces of the other tasks because they were essentially blocked the same as before.  I admit I don't know this software well, but why was my initial analysis incorrect?  Don't the locks always have to be locked in the same order?

Thanks for your help,
Dave

Thread 1 (Thread 0x7f1a3935d840 (LWP 27501)):
#0  in __lll_lock_wait () from /lib64/libpthread.so.0
#1  in _L_lock_839 () from /lib64/libpthread.so.0
#2  in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  in pj_mutex_lock (mutex=0x4594238) at ../src/pj/os_core_unix.c:1265
#4  in PJSUA_LOCK () at ../include/pjsua-lib/pjsua_internal.h:575
#5  in pjsua_call_get_info (call_id=1091, info=0x7fff7ce7c610) at ../src/pjsua-lib/pjsua_call.c:1817
#6  in on_call_state (call_id=1091, e=0x7fff7ce7ceb0) at TERMmain.c:509
#7  in pjsua_call_on_state_changed (inv=0x7f1a05272688, e=0x7fff7ce7ceb0) at ../src/pjsua-lib/pjsua_call.c:3824
#8  in inv_set_state (inv=0x7f1a05272688, state=PJSIP_INV_STATE_DISCONNECTED, e=0x7fff7ce7ceb0) at ../src/pjsip-ua/sip_inv.c:317
#9  in inv_on_state_incoming (inv=0x7f1a05272688, e=0x7fff7ce7ceb0) at ../src/pjsip-ua/sip_inv.c:4310
#10 in mod_inv_on_tsx_state (tsx=0x7f1a05273608, e=0x7fff7ce7ceb0) at ../src/pjsip-ua/sip_inv.c:717
#11 in pjsip_dlg_on_tsx_state (dlg=0x7f1a05c55598, tsx=0x7f1a05273608, e=0x7fff7ce7ceb0) at ../src/pjsip/sip_dialog.c:2064
#12 in mod_ua_on_tsx_state (tsx=0x7f1a05273608, e=0x7fff7ce7ceb0) at ../src/pjsip/sip_ua_layer.c:178
#13 in tsx_set_state (tsx=0x7f1a05273608, state=PJSIP_TSX_STATE_COMPLETED, event_src_type=PJSIP_EVENT_TX_MSG, event_src=0x7f1a043c0c38, flag=0) at ../src/pjsip/sip_transaction.c:1235
#14 in tsx_on_state_proceeding_uas (tsx=0x7f1a05273608, event=0x7fff7ce7cfb0) at ../src/pjsip/sip_transaction.c:2819
#15 in pjsip_tsx_send_msg (tsx=0x7f1a05273608, tdata=0x7f1a043c0c38) at ../src/pjsip/sip_transaction.c:1751
#16 in pjsip_dlg_send_response (dlg=0x7f1a05c55598, tsx=0x7f1a05273608, tdata=0x7f1a043c0c38) at ../src/pjsip/sip_dialog.c:1529
#17 in pjsip_inv_send_msg (inv=0x7f1a05272688, tdata=0x7f1a043c0c38) at ../src/pjsip-ua/sip_inv.c:3227
#18 in pjsua_call_hangup (call_id=1091, code=606, reason=0x0, msg_data=0x0) at ../src/pjsua-lib/pjsua_call.c:2426
#19 in timer_callback (timer_heap=0x7f1a15cc85f8, timer=0xbf6e90 <dialer_info+3972080>) at TERMmain.c:313
#20 in pj_timer_heap_poll (ht=0x7f1a15cc85f8, next_delay=0x0) at ../src/pj/timer.c:643
#21 in TIMERpoll () at TIMERmain.c:72
#22 in main_func (argc=1, argv=0x7fff7ce7d788) at main.c:352
#23 in pj_run_app (main_func=0x40c6a7 <main_func>, argc=1, argv=0x7fff7ce7d788, flags=0) at ../src/pj/os_core_unix.c:1952
#24 in main (argc=1, argv=0x7fff7ce7d788) at main.c:782


On Mon, Mar 20, 2017 at 5:33 AM, Ming <ming@xxxxxxxxx> wrote:
Hi all,

Thanks for the report and sorry for the delay in answering.

For the problem reported by David, we think the problem is actually caused by thread 1 (instead of thread 3) which shouldn't try to obtain PJSUA's lock after holding the transaction lock. So we propose the fix attached. Please revert your temporary fix (where you remove the acquisition of group lock in pjsip_tsx_layer_find_tsx()) and use our patch instead, and share with us whether it rectifies the issue. Thanks.

While for the deadlock issue reported by Alex, despite the similarity, I believe this is an unrelated issue, so I'll reply it separately in the original thread.

Finally, for Janu, without the stack trace, I can't be certain that your problem is the same as one or both of these issues. So my suggestion is to apply both patches and see if the problem disappears. Also, I would recommend to upgrade to version 2.6, if you haven't, since 2.3 is already quite old.

Best regards,
Ming

On Fri, Mar 17, 2017 at 4:58 PM, janu@xxxxxxxxxxx <janu@xxxxxxxxxxx> wrote:
Also seems like the one I reprted (happens also if the application calls pjsua_call_hangup() between receiving a sip BYE and sending the corresponding OK).
I also got no response to that.

Regards,
Thomas

------ Originalnachricht------
Von: Alex Hermann
Datum: Fr., 17. März 2017 09:49
Cc:
Betreff:Re: Interesting deadlock bug found causing three threads to deadlock on the PJSUA, UA, and transaction locks.

On donderdag 16 maart 2017 14:18:25 CET David Richards wrote:> I found a bug that causes my application (using the PJSUA API) to deadlockThis looks like the same deadlock I reported on 7-3-2017. Unfortunately, I got no response to it.-- Alex Hermann_______________________________________________Visit our blog: http://blog.pjsip.orgpjsip mailing listpjsip@lists.pjsip.orghttp://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@xxxxxxxxxxxxxxx
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org



_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@xxxxxxxxxxxxxxx
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org



_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@xxxxxxxxxxxxxxx
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org


Index: pjsip/include/pjsip/sip_transaction.h
===================================================================
--- pjsip/include/pjsip/sip_transaction.h	(revision 5564)
+++ pjsip/include/pjsip/sip_transaction.h	(working copy)
@@ -179,6 +179,10 @@
  * Find a transaction with the specified key. The transaction key normally
  * is created by calling #pjsip_tsx_create_key() from an incoming message.
  *
+ * IMPORTANT: To prevent deadlock, application should use
+ * #pjsip_tsx_layer_find_tsx2() instead which only adds a reference to
+ * the transaction instead of locking it.
+ *
  * @param key	    The key string to find the transaction.
  * @param lock	    If non-zero, transaction will be locked before the
  *		    function returns, to make sure that it's not deleted
@@ -191,6 +195,21 @@
 						      pj_bool_t lock );
 
 /**
+ * Find a transaction with the specified key. The transaction key normally
+ * is created by calling #pjsip_tsx_create_key() from an incoming message.
+ *
+ * @param key	    The key string to find the transaction.
+ * @param add_ref   If non-zero, transaction's reference will be added
+ *		    by one before the function returns, to make sure that
+ * 		    it's not deleted by other threads.
+ *
+ * @return	    The matching transaction instance, or NULL if transaction
+ *		    can not be found.
+ */
+PJ_DECL(pjsip_transaction*) pjsip_tsx_layer_find_tsx2( const pj_str_t *key,
+						       pj_bool_t add_ref );
+
+/**
  * Create, initialize, and register a new transaction as UAC from the 
  * specified transmit data (\c tdata). The transmit data must have a valid
  * \c Request-Line and \c CSeq header. 
Index: pjsip/src/pjsip/sip_transaction.c
===================================================================
--- pjsip/src/pjsip/sip_transaction.c	(revision 5564)
+++ pjsip/src/pjsip/sip_transaction.c	(working copy)
@@ -641,8 +641,8 @@
 /*
  * Find a transaction.
  */
-PJ_DEF(pjsip_transaction*) pjsip_tsx_layer_find_tsx( const pj_str_t *key,
-						     pj_bool_t lock )
+static pjsip_transaction* find_tsx( const pj_str_t *key, pj_bool_t lock,
+				    pj_bool_t add_ref )
 {
     pjsip_transaction *tsx;
     pj_uint32_t hval = 0;
@@ -654,7 +654,7 @@
     
     /* Prevent the transaction to get deleted before we have chance to lock it.
      */
-    if (tsx && lock)
+    if (tsx)
         pj_grp_lock_add_ref(tsx->grp_lock);
     
     pj_mutex_unlock(mod_tsx_layer.mutex);
@@ -666,9 +666,12 @@
     /* Simulate race condition! */
     PJ_RACE_ME(5);
 
-    if (tsx && lock) {
-	pj_grp_lock_acquire(tsx->grp_lock);
-        pj_grp_lock_dec_ref(tsx->grp_lock);
+    if (tsx) {
+	if (lock)
+	    pj_grp_lock_acquire(tsx->grp_lock);
+
+        if (!add_ref)
+            pj_grp_lock_dec_ref(tsx->grp_lock);
     }
 
     return tsx;
@@ -675,6 +678,20 @@
 }
 
 
+PJ_DEF(pjsip_transaction*) pjsip_tsx_layer_find_tsx( const pj_str_t *key,
+						     pj_bool_t lock )
+{
+    return find_tsx(key, lock, PJ_FALSE);
+}
+
+
+PJ_DEF(pjsip_transaction*) pjsip_tsx_layer_find_tsx2( const pj_str_t *key,
+						      pj_bool_t add_ref )
+{
+    return find_tsx(key, PJ_FALSE, add_ref);
+}
+
+
 /* This module callback is called when module is being loaded by
  * endpoint. It does nothing for this module.
  */
Index: pjsip/src/pjsip/sip_ua_layer.c
===================================================================
--- pjsip/src/pjsip/sip_ua_layer.c	(revision 5564)
+++ pjsip/src/pjsip/sip_ua_layer.c	(working copy)
@@ -551,12 +551,12 @@
 			     pjsip_get_invite_method(), rdata);
 
 	/* Lookup the INVITE transaction */
-	tsx = pjsip_tsx_layer_find_tsx(&key, PJ_TRUE);
+	tsx = pjsip_tsx_layer_find_tsx2(&key, PJ_TRUE);
 
 	/* We should find the dialog attached to the INVITE transaction */
 	if (tsx) {
 	    dlg = (pjsip_dialog*) tsx->mod_data[mod_ua.mod.id];
-	    pj_grp_lock_release(tsx->grp_lock);
+	    pj_grp_lock_dec_ref(tsx->grp_lock);
 
 	    /* Dlg may be NULL on some extreme condition
 	     * (e.g. during debugging where initially there is a dialog)
Index: pjsip/src/pjsip-ua/sip_inv.c
===================================================================
--- pjsip/src/pjsip-ua/sip_inv.c	(revision 5564)
+++ pjsip/src/pjsip-ua/sip_inv.c	(working copy)
@@ -3275,7 +3275,7 @@
 
     pjsip_tsx_create_key(rdata->tp_info.pool, &key, PJSIP_ROLE_UAS,
 			 pjsip_get_invite_method(), rdata);
-    invite_tsx = pjsip_tsx_layer_find_tsx(&key, PJ_TRUE);
+    invite_tsx = pjsip_tsx_layer_find_tsx2(&key, PJ_TRUE);
 
     if (invite_tsx == NULL) {
 
@@ -3324,7 +3324,7 @@
     }
 
     if (invite_tsx)
-	pj_grp_lock_release(invite_tsx->grp_lock);
+	pj_grp_lock_dec_ref(invite_tsx->grp_lock);
 }
 
 
_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@xxxxxxxxxxxxxxx
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

[Index of Archives]     [Asterisk Users]     [Asterisk App Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [Linux API]
  Powered by Linux