Re: Crash in pj_timer_heap_poll

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Hugo,

Just got a chance to continue investigation on this issue. Turned out that the problem is rather complex, for example, cached tx_data may be destroyed by on_cache_timeout() or by STUN session destroy or even by both (race condition), so synchronization between STUN session and cached tx_data is needed. Attached is an updated patch.

BR,
nanang


On Thu, Jul 11, 2019 at 12:35 AM Hugo Sobral <hugo.sobral@xxxxxxxxx> wrote:
Hi Nanang,

I have the same result but in a different location

Exception thrown: read access violation.
   atomic_var was 0xFFFFFFFFFFFFFFFF. occurred
pj_wrapper.dll!pj_atomic_dec_and_get(pj_atomic_t * atomic_var) Line 800
   at pjlib\src\pj\os_core_win32.c(800)
pj_wrapper.dll!grp_lock_dec_ref(pj_grp_lock_t * glock) Line 554
   at pjlib\src\pj\lock.c(554)
pj_wrapper.dll!pj_grp_lock_dec_ref(pj_grp_lock_t * glock) Line 647
   at pjlib\src\pj\lock.c(647)
pj_wrapper.dll!pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val * next_delay) Line 654
   at pjlib\src\pj\timer.c(654)


Some vars from debugger:

&atomic_var 0x0000002f293fe670 {0xdddddddddddddddd {value=??? }} pj_atomic_t * *
   0xdddddddddddddddd {value=??? } pj_atomic_t *
      value <Unable to read memory>

&atomic_var->value 0xdddddddddddddddd {???} long *
   <Unable to read memory> long






On Tue, 9 Jul 2019 at 12:16, Nanang Izzuddin <nanang@xxxxxxxxx> wrote:
Hi Hugo,

Instead of 'pj_stun_session', it is perhaps the 'pj_stun_tx_data' that had been destroyed when 'on_cache_timeout' was invoked, and this scenario seems to be possible indeed. So just created a patch (attached), unfortunately I could not easily reproduce the issue here, so if the issue is reproducible there, please help us testing the patch, thank you in advance.

BR,
nanang


On Fri, Jul 5, 2019 at 7:10 PM Hugo Sobral <hugo.sobral@xxxxxxxxx> wrote:
Hi Nanang,

It was in this line:

    PJ_LOG(5,(SNAME(tdata->sess), "Response cache deleted"));



Exception thrown: read access violation.
   tdata->sess was 0xFFFFFFFFFFFFFFF7. occurred
pj_wrapper.dll!on_cache_timeout(pj_timer_heap_t * timer_heap, pj_timer_entry * entry) Line 226
   at pjnath\src\pjnath\stun_session.c(226)
pj_wrapper.dll!pj_timer_heap_poll(pj_timer_heap_t * ht, pj_time_val * next_delay) Line 651
   at pjlib\src\pj\timer.c(651)
[Managed to Native Transition]


From debugger, current values:

   tdata   0x000001c5f264f828 {prev=0xdddddddddddddddd {prev=??? next=??? pool=??? ...} next=0xdddddddddddddddd {...} ...}   pj_stun_tx_data *
      prev   0xdddddddddddddddd {prev=??? next=??? pool=??? ...}   pj_stun_tx_data *
      next   0xdddddddddddddddd {prev=??? next=??? pool=??? ...}   pj_stun_tx_data *
      pool   0xdddddddddddddddd {prev=??? next=??? obj_name=0xdddddddddddddded <Error reading characters of string.> ...}   pj_pool_t *
      sess   0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...}   pj_stun_session *
      msg   0xdddddddddddddddd {hdr={type=??? length=??? magic=??? ...} attr_count=??? attr=0xddddddddddddddf5 {...} }   pj_stun_msg *
      token   0xdddddddddddddddd   void *
      client_tsx   0xdddddddddddddddd {obj_name=0xdddddddddddddddd <Error reading characters of string.> cb={_on_complete_=...} ...}   pj_stun_client_tsx *
      retransmit   -572662307   int
      msg_magic   3722304989   unsigned int
      msg_key   0x000001c5f264f868 "ÝÝÝÝÝÝÝÝÝÝÝÝ...   unsigned char[12]
      auth_info   {realm={ptr=0xdddddddddddddddd <Error reading characters of string.> slen=-2459565876494606883 } username=...}   pj_stun_req_cred_info
      pkt   0xdddddddddddddddd   void *
      max_len   3722304989   unsigned int
      pkt_size   15987178197214944733   unsigned __int64
      addr_len   3722304989   unsigned int
      dst_addr   0xdddddddddddddddd   const void *
      res_timer   {user_data=0xdddddddddddddddd id=-572662307 cb=0xdddddddddddddddd ...}   pj_timer_entry
     
     
   tdata->sess   0xdddddddddddddddd {cfg=??? pool=??? grp_lock=??? ...}   pj_stun_session *
      cfg   <Unable to read memory>  
      pool   <Unable to read memory>  
      grp_lock   <Unable to read memory>  
      cb   {_on_send_msg_=??? _on_rx_request_=??? _on_request_complete_=??? ...}   pj_stun_session_cb
      user_data   <Unable to read memory>  
      is_destroying   <Unable to read memory>  
      use_fingerprint   <Unable to read memory>  
      rx_pool   <Unable to read memory>  
      dump_buf   0xddddddddddddde2d <Error reading characters of string.>   char[1000]
      log_flag   <Unable to read memory>  
      auth_type   <Unable to read memory>  
      cred   {type=??? data="" slen=??? } username={ptr=??? slen=??? } data_type=??? ...} ...} }   pj_stun_auth_cred
      auth_retry   <Unable to read memory>  
      next_nonce   {ptr=??? slen=??? }   pj_str_t
      server_realm   {ptr=??? slen=??? }   pj_str_t
      srv_name   {ptr=??? slen=??? }   pj_str_t
      pending_request_list   {prev=??? next=??? pool=??? ...}   pj_stun_tx_data
      cached_response_list   {prev=??? next=??? pool=??? ...}   pj_stun_tx_data

Best Regards,
Hugo Sobral

On Wed, 3 Jul 2019 at 11:04, Nanang Izzuddin <nanang@xxxxxxxxx> wrote:
Hi Hugo,

The STUN tdata timer in STUN session (i.e: tdata->res_timer) is scheduled using 'pj_stun_session' group lock, so normally it should be safe from race condition between 'pj_stun_session' destroy and 'on_cache_timeout', but perhaps there is a corner case, could you also send the call stack traces when the crash occurs?

BR,
nanang


On Sat, Jun 29, 2019 at 1:30 AM Hugo Sobral <hugo.sobral@xxxxxxxxx> wrote:
Hi,

Sometimes I get a crash in "pj_timer_heap_poll"
I notice that there was some corrections about that in milestone 2.9


But these corrections does not solve my issue.

After some time debugging the issue, it looks like that it happens when the code:

-------------------------------
if (node->cb)
(*node->cb)(ht, node);
------------------------------- 
in function  'pj_timer_heap_poll' is invoked with the function 'on_cache_timeout' at the same time that 'pj_stun_session' is being destroyed (this destroy is made by application's code because the call terminates).

I cannot understand exactly how this happens,
but I think that there is a time gap in function 'pj_timer_heap_poll' after the 'unlock_timer_heap(ht)' that 'pj_stun_session' can be destroyed and crash inside 'on_cache_timeout'


Is this a known issue in the specified functions?

Best regards,
Hugo Sobral






_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@xxxxxxxxxxxxxxx
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org
_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@xxxxxxxxxxxxxxx
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org
_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@xxxxxxxxxxxxxxx
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org
_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@xxxxxxxxxxxxxxx
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org
_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@xxxxxxxxxxxxxxx
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

Attachment: stun_tdata_double_destroy.diff
Description: Binary data

_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@xxxxxxxxxxxxxxx
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

[Index of Archives]     [Asterisk Users]     [Asterisk App Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [Linux API]
  Powered by Linux