Interesting deadlock bug found causing three threads to deadlock on the PJSUA, UA, and transaction locks.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi.

I found a bug that causes my application (using the PJSUA API) to deadlock when the following three events occur simultaneously:

1)   A call (received earlier by my application) is being "hung up" by calling pjsua_call_hangup(),
2The same call is also receiving a CANCEL message, and
3)   An INVITE is being received for a new separate call (as odd as that sounds).

I can reproduce this, and have multiple times, but it requires a heavy load (and luck) to create the proper event sequencing. I have included some "gdb" stack back traces below for the three threads. The bug is located in the processing of the CANCEL message, where the transaction group lock is attempted to be obtained after the user agent lock. That's the wrong locking order. Specifically, the CANCEL message is processed by mod_ua_on_rx_request() where the user agent lock is obtained and find_dlg_set_for_msg() is called to get the dialog set. The transaction group lock is attempted to be acquired by pjsip_tsx_layer_find_tsx()  which is called by find_dlg_set_for_msg() to find the INVITE transaction.

This is a possible sequence of events that cause the deadlock (see the stack back traces below):

1)  Thread 1 calls pjsua_call_hangup() to hang up a call.
2)  Thread 1 obtains the transaction lock pj_grp_lock_acquire(tsx->grp_lock).

3Thread 2 starts processing a new call (INVITE msg).
4Thread 2 obtains the PJSUA lock PJSUA_LOCK().

5Thread 3 starts processing a CANCEL msg for the call being "hung up" by Thread 1.
6Thread 3 obtains the UA lock pj_mutex_lock(mod_ua.mutex).
7Thread 3 attempts to obtain the transaction lock, but is blocked because Thread 1 is holding it.

8Thread 1 attempts to obtain the PJSUA lock, but is blocked because Thread 2 is holding it.

9Thread 2 attempts to obtain the UA lock, but is blocked because Thread 3 is holding it.

I modified the code to NOT to acquire the transaction group lock and deadlock no longer occurs (or at least not after 12 hours of heavy load testing). Based on the comments in the code, the group transaction lock is obtained in pjsip_tsx_layer_find_tsx() in order to "prevent the transaction from getting deleted before we have chance to lock it". Clearly the lock removal stops deadlock, but the transaction deletion concern mentioned in the code needs to be addressed.

A few other observations:

The function  mod_ua_on_rx_request() is checking for dialog deadlock, and has implemented a retry loop for mediation. After I took out the transaction group lock acquisition, I noticed that dialog deadlock was being detected and handled correctly (through the retry loop) for the calls that would otherwise caused the three thread deadlock described above.

I did not experience deadlock with my application until I modified it to handle SIP traffic from two UDP transports (from two separate network interfaces).  I don't know how that's related, except that probably made it possible for the new INVITE (Thread 2) and the CANCEL (Thread 3) messages to be processed simultaneously. I confirmed that in this case, both messages came in on separate transports.

I have more detailed 'gdb' output if you are interested.

Thanks,
Dave Richards

Thread 1: My application "hanging up" a call with code "606" (Not Acceptable)
          (this thread is holding the tsx->grp_lock, and blocked on the PJSUA lock */
#0  __lll_lock_wait () from /lib64/libpthread.so.0
#1  _L_lock_839 () from /lib64/libpthread.so.0
#2  pthread_mutex_lock () from /lib64/libpthread.so.0
#3  pj_mutex_lock (mutex=0x4878238) at ../src/pj/os_core_unix.c:1265
#4  PJSUA_LOCK () at ../include/pjsua-lib/pjsua_internal.h:575              /* blocked on PJSUA_LOCK() */
#5  pjsua_call_on_state_changed () at ../src/pjsua-lib/pjsua_call.c:3807
#6  inv_set_state () at ../src/pjsip-ua/sip_inv.c:317
#7  inv_on_state_incoming () at ../src/pjsip-ua/sip_inv.c:4310
#8  mod_inv_on_tsx_state () at ../src/pjsip-ua/sip_inv.c:717
#9  pjsip_dlg_on_tsx_state () at ../src/pjsip/sip_dialog.c:2064
#10 mod_ua_on_tsx_state () at ../src/pjsip/sip_ua_layer.c:178
#11 tsx_set_state () at ../src/pjsip/sip_transaction.c:1235
#12 tsx_on_state_proceeding_uas () at ../src/pjsip/sip_transaction.c:2819
#13 pjsip_tsx_send_msg () at ../src/pjsip/sip_transaction.c:1751            /* obtained pj_grp_lock_acquire(tsx->grp_lock) */
#14 pjsip_dlg_send_response () at ../src/pjsip/sip_dialog.c:1529
#15 pjsip_inv_send_msg () at ../src/pjsip-ua/sip_inv.c:3227
#16 pjsua_call_hangup (code=606) at ../src/pjsua-lib/pjsua_call.c:2407
#17 timer_callback () at TERMmain.c:313
#18 pj_timer_heap_poll () at ../src/pj/timer.c:643
#19 TIMERpoll ()
#20 main_func ()
#21 pj_run_app ()
#22 main ()

Thread 2: Worker thread handling new call (incoming INVITE)
          (this thread is holding the PJSUA lock, blocked on the mod_ua.mutex lock)
#0  __lll_lock_wait () from /lib64/libpthread.so.0
#1  _L_lock_839 () from /lib64/libpthread.so.0
#2  pthread_mutex_lock () from /lib64/libpthread.so.0
#3  pj_mutex_lock () at ../src/pj/os_core_unix.c:1265
#4  pjsip_ua_register_dlg () at ../src/pjsip/sip_ua_layer.c:298            /* blocked on pj_mutex_lock(mod_ua.mutex) */
#5  create_uas_dialog () at ../src/pjsip/sip_dialog.c:533
#6  pjsip_dlg_create_uas_and_inc_lock () at ../src/pjsip/sip_dialog.c:590
#7  pjsua_call_on_incoming () at ../src/pjsua-lib/pjsua_call.c:1386        /* obtained PJSUA_LOCK() */
#8  mod_pjsua_on_rx_request () at ../src/pjsua-lib/pjsua_core.c:577
#9  pjsip_endpt_process_rx_data () at ../src/pjsip/sip_endpoint.c:887
#10 endpt_on_rx_msg () at ../src/pjsip/sip_endpoint.c:1037
#11 pjsip_tpmgr_receive_packet () at ../src/pjsip/sip_transport.c:1939
#12 udp_on_read_complete () at ../src/pjsip/sip_transport_udp.c:170
#13 ioqueue_dispatch_read_event () at ../src/pj/ioqueue_common_abs.c:605
#14 pj_ioqueue_poll () at ../src/pj/ioqueue_epoll.c:796
#15 pjsip_endpt_handle_events2 () at ../src/pjsip/sip_endpoint.c:742
#16 pjsua_handle_events () at ../src/pjsua-lib/pjsua_core.c:1981
#17 worker_thread () at ../src/pjsua-lib/pjsua_core.c:699
#18 thread_main () at ../src/pj/os_core_unix.c:541
#19 start_thread () from /lib64/libpthread.so.0
#20 clone () from /lib64/libc.so.6

Thread 3: OFFENDING Worker thread handling CANCEL message for call being "hung up" in Thread 1.
          (this thread is holding the mod_ua.mutex lock, blocked on the tsx->grp_lock lock)
#0  __lll_lock_wait () from /lib64/libpthread.so.0
#1  _L_lock_839 () from /lib64/libpthread.so.0
#2  pthread_mutex_lock () from /lib64/libpthread.so.0
#3  pj_mutex_lock () at ../src/pj/os_core_unix.c:1265
#4  pj_lock_acquire () at ../src/pj/lock.c:180
#5  grp_lock_acquire () at ../src/pj/lock.c:290
#6  pj_grp_lock_acquire () at ../src/pj/lock.c:478
#7  pjsip_tsx_layer_find_tsx () at ../src/pjsip/sip_transaction.c:672    /* blocked on pj_grp_lock_acquire(tsx->grp_lock) */
#8  find_dlg_set_for_msg () at ../src/pjsip/sip_ua_layer.c:554
#9  mod_ua_on_rx_request () at ../src/pjsip/sip_ua_layer.c:615           /* obtains pj_mutex_lock(mod_ua.mutex) */
#10 pjsip_endpt_process_rx_data () at ../src/pjsip/sip_endpoint.c:887
#11 endpt_on_rx_msg () at ../src/pjsip/sip_endpoint.c:1037
#12 pjsip_tpmgr_receive_packet () at ../src/pjsip/sip_transport.c:1939
#13 udp_on_read_complete () at ../src/pjsip/sip_transport_udp.c:170
#14 ioqueue_dispatch_read_event () at ../src/pj/ioqueue_common_abs.c:605
#15 pj_ioqueue_poll () at ../src/pj/ioqueue_epoll.c:796
#16 pjsip_endpt_handle_events2 () at ../src/pjsip/sip_endpoint.c:742
#17 pjsua_handle_events () at ../src/pjsua-lib/pjsua_core.c:1981
#18 worker_thread () at ../src/pjsua-lib/pjsua_core.c:699
#19 thread_main () at ../src/pj/os_core_unix.c:541
#20 start_thread () from /lib64/libpthread.so.0
#21 clone () from /lib64/libc.so.6

_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@xxxxxxxxxxxxxxx
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

[Index of Archives]     [Asterisk Users]     [Asterisk App Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [Linux API]
  Powered by Linux