debug tips for cpu usage going wrong, deadlock issues, within pjsip or not. win32 specific thread profiling function used.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



At 03:11 AM 5/18/2009, M.S. wrote:
>thank you for your detailed debug summary. I am using linux but 
>__FILE__ and __LINE__ is working too.
>For the thread tagging, i will use the linux pids (process-id).
>
>I thought i have a deadlock because:
>- if i have 1 or 2 connections(calls) i have < 5 % processor use 
>(small arm system).
>-if i have 4 connections if get (only for one thread) 80-100% 
>processor use. (after a view seconds).
>(for each media stream i use a separate conference bridge. all this 
>threads will used only (<2% cpu))
>- after this i get some deadlock informations form the stack like:
>
>possible deadlock ........
>
>Questions:
>it is possible that this could happened  if  any stack state machine 
>checks his own  mutex like "bool IsPjsuaLocked()" ?

Don't know.  I will point to this out though.  I can get to 93 
callers on one conference bridge.  Never tried the multiple
conference bridges.  You might be running into something with that I 
have not seen.  Run more than 93 callers and cpu does
not spike but audio playback drags out taking like 10 times longer 
than it should.


>I think i will build this stuff your wrote in the pjlib (thread 
>tagging and detaild mutex logging with __FILE__ and __LINE__)
>
>
>thank you
>
>     Mark
>
>
>
>Von: David Clark <vdc1048 at tx.rr.com>
>An: pjsip list <pjsip at lists.pjsip.org>; pjsip list <pjsip at lists.pjsip.org>
>Gesendet: Samstag, den 16. Mai 2009, 09:08:42 Uhr
>Betreff: [pjsip] debug tips for cpu usage going wrong, deadlock 
>issues, within pjsip or not. win32 specific thread profiling function used.
>
>No 100% might well be a bug but most likely not deadlock.  deadlock 
>is where you are locked by the OS waiting for a mutex you will never get.
>This locked state uses 0% cpu.  The usual symptom I see is after x 
>number of calls.   pjsip just does not communicate with anyone.  Even SJPhone
>the sip phone application is ignored.
>
>But how to resolve your cpu goes to 100% issue.  Yea not putting 
>sleep(1) in some of your loops might be the cause of it.  But if you 
>have a multithread
>program in windows that is using 100% of the cpu and you don't know 
>why.  I got some tricks to tell you where.  Tell you which thread is using
>that cpu and rule out others.
>
>The solution is to create a linked list of every thread in the box 
>yorus and pjsip threads.  You can do this with a wrapper function 
>around CreateThread()
>for your stuff.  You can do this by having pjsip's win32 create 
>thread function call a function which simply adds the thread pjsip 
>created to your linked list.
>
>Ok got your linked list.  Next step.
>At regular intervals like say once a minute call a function which 
>for each thread in your linked list calls this function:
>/************************************************************************/
>/*                                                                      */
>/* Given a thread handle this function will determine the percent of    */
>/* time the thread has spent in user mode and kernal mode.              */
>/*                                                                      */
>/************************************************************************/
>int thread_percent(HANDLE thread, int *user_mode, int *kernel_mode)
>{
>     __int64 user_percent;
>     __int64 kernel_percent;
>
>     LARGE_INTEGER process_utime;
>     LARGE_INTEGER process_ktime;
>     LARGE_INTEGER thread_utime;
>     LARGE_INTEGER thread_ktime;
>
>     FILETIME process_creation_time;
>     FILETIME process_exit_time;
>     FILETIME process_user_time;
>     FILETIME process_kernel_time;
>
>     FILETIME thread_creation_time;
>     FILETIME thread_exit_time;
>     FILETIME thread_user_time;
>     FILETIME thread_kernel_time;
>
>     BOOL retval1, retval2;
>     int retval=0; // assume failure.
>
>
>     retval1=GetProcessTimes(GetCurrentProcess(),
>                     &process_creation_time,
>                     &process_exit_time,
>                     &process_kernel_time,
>                     &process_user_time);
>
>     retval2=GetThreadTimes(thread,
>                    &thread_creation_time,
>                    &thread_exit_time,
>                    &thread_kernel_time,
>                    &thread_user_time);
>
>     if ((retval1) && (retval2)) // if both functions worked.
>     {
>         memcpy(&process_utime, &process_user_time, sizeof(FILETIME));
>         memcpy(&process_ktime, &process_kernel_time, sizeof(FILETIME));
>         memcpy(&thread_utime, &thread_user_time, sizeof(FILETIME));
>         memcpy(&thread_ktime, &thread_kernel_time, sizeof(FILETIME));
>
>                   if (process_utime.QuadPart==0)
>                            user_percent=0;
>                   else
> 
>user_percent=thread_utime.QuadPart*100/process_utime.QuadPart;
>         if (process_ktime.QuadPart==0)
>                  kernel_percent=0;
>         else
> 
>kernel_percent=thread_ktime.QuadPart*100/process_ktime.QuadPart;
>
>         *user_mode=(int)user_percent;
>         *kernel_mode=(int)kernel_percent;
>         retval=1;
>     }
>     return(retval);
>}
>You will find most threads will be 0 and 0 and one will be 
>significantly higher say 30 and 40.  The one that is significantly 
>higher is the offender.
>
>Hope this helps and makes sense.
>
>On the deadlock debugging I did do.  I did this.  I moved pjsip 
>debug level at compile time and runtime to 6.
>But that was still not enough to tell me what I needed to know.  So 
>I augumented the mutex lock/unlock/try/destroy functions in pjilib
>with a version that gave the callers __FILE__, and 
>__LINE__.  __FILE__ gives the source module of the caller, and 
>__LINE__ gives the
>line number the function was called from.
>
>I did something like this:
>#define mutex_lock(lock) _mutex_lock(lock, __FILE__, __LINE__)
>
>_mutex_lock(multex *lock, char *file, int line)
>{
>    // in here I added file, and line to the level 6 debug output.
>}
>
>Run that and you know where you blocked and where you locked prior 
>to that which gives the complete story.
>
>David Clark
>ps. Note this does produce a ton of debug logs.  We are talking 
>gigabytes.   In my applications I have been able to
>give mutex locking information as part of the thread usage 
>reports.  I list what thread is blocked on which mutex
>where the block happened and where ownership was last successfully 
>obtained.  The complete story on one line.
>I have not put this into pjsip yet.  It involved replacing all mutex 
>functions, and since my application never uses
>a try mutex function, I don't have that one.  And pjsip uses that 
>heavily.  So I would need to add that function.
>
>At 02:54 AM 5/15/2009, M.S. wrote:
>>i think you have great skills to debug  deadlock situations in pjsip.
>>can you give me a  good advice  to debug pjsip/pjsua mutex stuff 
>>(log-level, debug outputs etc.)
>>
>>i have a situation that pjsip got 100% processor use. i think i 
>>have a deadlock.
>>
>>   regards
>>     mark
>>
>>
>>Von: David Clark <vdc1048 at tx.rr.com>
>>An: pjsip list <pjsip at lists.pjsip.org>; pjsip list <pjsip at lists.pjsip.org>
>>Gesendet: Donnerstag, den 14. Mai 2009, 22:22:56 Uhr
>>Betreff: [pjsip] pjsip version 1.1 gotcha #2 with work around.
>>
>>This is similar to gotcha #1.  So I will go into less detail.  If 
>>the on_call_state() handler which gets hangup
>>notifications calls pjsua_recorder_destroy().  The box can stop 
>>communicating after that for the following reason.
>>1) pjsua_recorder_destroy() locks pjsua mutex.
>>2) pjsua_recorder_destroy() locals conf mutex indirectly by calling 
>>pjsua_conf_remove_port().
>>
>>But if conf mutex is locked in port audio clock thread get_frame() 
>>at the point of the call.
>>
>>You will block waiting for conf mutex and other threads will never 
>>get pjsua mutex.
>>
>>Work around: don't call pjsua_recorder_destroy() in on_call_state() 
>>signal the application thread to wait up and do it
>>in the application thread.
>>
>>David Clark
>>
>>_______________________________________________
>>Visit our blog: <http://blog.pjsip.org>http://blog.pjsip.org
>>
>>pjsip mailing list
>>pjsip at lists.pjsip.org
>><http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org>http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org 
>>
>
>_______________________________________________
>Visit our blog: http://blog.pjsip.org
>
>pjsip mailing list
>pjsip at lists.pjsip.org
>http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pjsip.org/pipermail/pjsip_lists.pjsip.org/attachments/20090520/8399d1f8/attachment.html>


[Index of Archives]     [Asterisk Users]     [Asterisk App Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [Linux API]
  Powered by Linux