debug tips for cpu usage going wrong, deadlock issues, within pjsip or not. win32 specific thread profiling function used.

hamstiede@xxxxxxxx (M.S.) · Wed, 20 May 2009 07:26:05 +0000 (GMT)

Hello Gang,

to read the wiki guideline is the first step to resolve the problems, but if you have a multi threaddding/hyper thredding/multi-processor system and many  connection for one agent, the race conditions will increase.
i think i read the whole pjsip mail archive, and every month i heard about deadlock situations, infinity loops e.g.  
In my own applications i often use __FILE__ and __LINE__ debug outputs for mutex systems. it works fine !!!

regards 
   mark

________________________________
Von: Gang Liu <gangban.lau at gmail.com>
An: pjsip list <pjsip at lists.pjsip.org>
Gesendet: Mittwoch, den 20. Mai 2009, 05:30:51 Uhr
Betreff: Re: [pjsip] debug tips for cpu usage going wrong, deadlock issues, within pjsip or not. win32 specific thread profiling function used.

http://trac.pjsip.org/repos/wiki/PJSUA_Locks

Is this guide line useful?
Indeed, it is still possible to create some SIP message flow to cause PJSUA or PJSIP-UA deadlock.But this is another topic.

regards,
Gang

On Mon, May 18, 2009 at 4:11 PM, M.S. <hamstiede at yahoo.de> wrote:

thank you for your detailed debug summary. I am using linux but __FILE__ and __LINE__ is working too.
For the thread tagging, i will use the linux pids (process-id).

I thought i have a deadlock because:
- if i have 1 or 2 connections(calls) i have < 5 % processor use (small arm system).
-if i have 4 connections if get (only for one thread) 80-100% processor use. (after a view seconds).
(for each media stream i use a separate conference bridge. all this threads will used only (<2% cpu))
- after this i get some deadlock informations form the stack like:

possible deadlock ........

Questions:
it is possible that this could happened  if  any stack state machine checks his own  mutex like "bool IsPjsuaLocked()" ?

I think i will build this stuff your wrote in the pjlib (thread tagging and detaild mutex logging with __FILE__ and __LINE__)

thank you

    Mark

________________________________

Von: David Clark <vdc1048 at tx.rr.com>
An: pjsip list <pjsip at lists.pjsip.org>; pjsip list <pjsip at lists.pjsip.org>
Gesendet: Samstag, den 16. Mai 2009, 09:08:42 Uhr
Betreff: [pjsip] debug tips for cpu usage going wrong, deadlock issues, within pjsip or not. win32 specific thread profiling function used.

No 100% might well be a bug but most likely not deadlock.  deadlock is where you are locked by the OS waiting for a mutex you will never get.  
This locked state uses 0% cpu.  The usual symptom I see is after x number of calls.   pjsip just does not communicate with anyone.  Even SJPhone
the sip phone application is ignored.  

But how to resolve your cpu goes to 100% issue.  Yea not putting sleep(1) in some of your loops might be the cause of it.  But if you have a multithread
program in windows that is using 100% of the cpu and you don't know why.  I got some tricks to tell you where.  Tell you which thread is using
that cpu and rule out others.  

The solution is to create a linked list of every thread in the box yorus and pjsip threads.  You can do this with a wrapper function around CreateThread()
for your stuff.  You can do this by having pjsip's win32 create thread function call a function which simply adds the thread pjsip created to your linked list.

Ok got your linked list.  Next step.  
At regular intervals like say once a minute call a function which for each thread in your linked list calls this function:
/************************************************************************/
/*                                                                      */
/* Given a thread handle this function will determine the percent of    */
/* time the thread has spent in user mode and kernal mode.              */
/*                                                                      */
/************************************************************************/
int thread_percent(HANDLE thread, int *user_mode, int *kernel_mode)
{
    __int64 user_percent;
    __int64 kernel_percent;

    LARGE_INTEGER process_utime;
    LARGE_INTEGER process_ktime;
    LARGE_INTEGER thread_utime;
    LARGE_INTEGER thread_ktime;

    FILETIME process_creation_time;
    FILETIME process_exit_time;
    FILETIME process_user_time;
    FILETIME process_kernel_time;

    FILETIME thread_creation_time;
    FILETIME thread_exit_time;
    FILETIME thread_user_time;
    FILETIME thread_kernel_time;

    BOOL retval1, retval2;
    int retval=0; // assume failure.

    retval1=GetProcessTimes(GetCurrentProcess(),
                    &process_creation_time,
                    &process_exit_time,
                    &process_kernel_time,
                    &process_user_time);

    retval2=GetThreadTimes(thread,
                   &thread_creation_time,
                   &thread_exit_time,
                   &thread_kernel_time,
                   &thread_user_time);

    if ((retval1) && (retval2)) // if both functions worked.
    {
        memcpy(&process_utime, &process_user_time, sizeof(FILETIME));
        memcpy(&process_ktime, &process_kernel_time, sizeof(FILETIME));
        memcpy(&thread_utime, &thread_user_time, sizeof(FILETIME));
        memcpy(&thread_ktime, &thread_kernel_time, sizeof(FILETIME));

                  if (process_utime.QuadPart==0)
                           user_percent=0;
                  else
                 user_percent=thread_utime.QuadPart*100/process_utime.QuadPart;
        if (process_ktime.QuadPart==0)
                 kernel_percent=0;
        else
                 kernel_percent=thread_ktime.QuadPart*100/process_ktime.QuadPart;

        *user_mode=(int)user_percent;
        *kernel_mode=(int)kernel_percent;
        retval=1;
    }
    return(retval);
}
You will find most threads will be 0 and 0 and one will be significantly higher say 30 and 40.  The one that is significantly higher is the offender.

Hope this helps and makes sense.  

On the deadlock debugging I did do.  I did this.  I moved pjsip debug level at compile time and runtime to 6.  
But that was still not enough to tell me what I needed to know.  So I augumented the mutex lock/unlock/try/destroy functions in pjilib
with a version that gave the callers __FILE__, and __LINE__.  __FILE__ gives the source module of the caller, and __LINE__ gives the
line number the function was called from.  

I did something like this:
#define mutex_lock(lock) _mutex_lock(lock, __FILE__, __LINE__)

_mutex_lock(multex *lock, char *file, int line)
{
   // in here I added file, and line to the level 6 debug output.  
}

Run that and you know where you blocked and where you locked prior to that which gives the complete story. 

David Clark
ps. Note this does produce a ton of debug logs.  We are talking gigabytes.   In my applications I have been able to 
give mutex locking information as part of the thread usage reports.  I list what thread is blocked on which mutex
where the block happened and where ownership was last successfully obtained.  The complete story on one line.
I have not put this into pjsip yet.  It involved replacing all mutex functions, and since my application never uses
a try mutex function, I don't have that one.  And pjsip uses that heavily.  So I would need to add that function.  

At 02:54 AM 5/15/2009, M.S. wrote:

i think you have great skills to debug  deadlock situations in pjsip.
can you give me a  good advice  to debug pjsip/pjsua mutex stuff (log-level, debug outputs etc.)

i have a situation that pjsip got 100% processor use. i think i have a deadlock.

  regards
    mark

Von: David Clark <vdc1048 at tx.rr.com>
An: pjsip list <pjsip at lists.pjsip.org>; pjsip list <pjsip at lists.pjsip.org>
Gesendet: Donnerstag, den 14. Mai 2009, 22:22:56 Uhr
Betreff: [pjsip] pjsip version 1.1 gotcha #2 with work around.

This is similar to gotcha #1.  So I will go into less detail.  If the on_call_state() handler which gets hangup 
notifications calls pjsua_recorder_destroy().  The box can stop communicating after that for the following reason.
1) pjsua_recorder_destroy() locks pjsua mutex.
2) pjsua_recorder_destroy() locals conf mutex indirectly by calling pjsua_conf_remove_port().  

But if conf mutex is locked in port audio clock thread get_frame() at the point of the call.  

You will block waiting for conf mutex and other threads will never get pjsua mutex.

Work around: don't call pjsua_recorder_destroy() in on_call_state() signal the application thread to wait up and do it
in the application thread. 

David Clark

_______________________________________________
Visit our blog: http://blog.pjsip.org 

pjsip mailing list
pjsip at lists.pjsip.org
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org 

_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip at lists.pjsip.org
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pjsip.org/pipermail/pjsip_lists.pjsip.org/attachments/20090520/b9fbb741/attachment.html>