PJNATH 2.2.1 bug: deadlock in stun_sock.c

GHilliard@xxxxxxxxxx (George Hilliard) · Tue, 5 Aug 2014 21:33:41 +0000

I posted a while back [1] about a deadlock occurring in PJNATH when receiving callbacks.  The issue is twofold (matching what I was seeing, "lock level = 2"), as follows.

First, the initial configuration of the active socket at stun_sock.c:320 hardcodes concurrency to false.  The whole_data flag is true; together, these disable concurrency and prevent a grp_lock from being unlocked in the proper order.  This change was made in ticket #460.

Next, stun_sock.c:957 does not unlock the grp_lock before invoking the callback, meaning it's locked during the callback...  Our custom callback then goes on to lock our application mutex.  Meanwhile, another thread in our application already has the application mutex locked and makes a call which attempts to lock the grp_lock.

A patch [2] fixes the deadlock we're seeing, but I'm concerned that the first hack of .whole_data and .concurrency might introduce other bugs.  I'm not familiar with this enough to make the call; could a developer please explain the best way to correct this?  The second change seems harmless, because nothing is done after the mutex is unlocked except call the callback.  Please correct me if I'm wrong. 

George Hilliard

[1]: http://lists.pjsip.org/pipermail/pjsip_lists.pjsip.org/2014-June/017675.html
[2]: https://gist.github.com/thirtythreeforty/d37d98d324a17d1121a2