debug tips for cpu usage going wrong, deadlock issues, within pjsip or not. win32 specific thread profiling function used.

gangban.lau@xxxxxxxxx (Gang Liu) · Wed, 20 May 2009 15:51:37 +0800

and use gdb attach to your program to dump the thread stack maybe another
helpful method.

regards,
Gang

On Wed, May 20, 2009 at 3:47 PM, Gang Liu <gangban.lau at gmail.com> wrote:

> Everyone has his own prefer method to debug, trace problem. I say nothing
> at it.
>
> But how many guys have read this document before using pjsua API? I don't
> think so.
>
> I will close my mouth if you don't like some advice.
>
> I tried Dual Quard Core Xeon server and 1000 cps, 20000 channels. Is it
> enough?
>
> I know may be it is difference between pjsip-ua api and pjsua api.But The
> key is our own program logic design must follow this guide.And spend enough
> time to understand why we need follow this guide. Then the world will be
> more easy.
>
> regards,
> Gang
>   On Wed, May 20, 2009 at 3:26 PM, M.S. <hamstiede at yahoo.de> wrote:
>
>>  Hello Gang,
>>
>> to read the wiki guideline is the first step to resolve the problems, but
>> if you have a multi threaddding/hyper thredding/multi-processor system and
>> many  connection for one agent, the race conditions will increase.
>> i think i read the whole pjsip mail archive, and every month i heard about
>> deadlock situations, infinity loops e.g.
>> In my own applications i often use __FILE__ and __LINE__ debug outputs for
>> mutex systems. it works fine !!!
>>
>>
>>
>> regards
>>    mark
>>
>>
>>  ------------------------------
>> *Von:* Gang Liu <gangban.lau at gmail.com>
>> *An:* pjsip list <pjsip at lists.pjsip.org>
>> *Gesendet:* Mittwoch, den 20. Mai 2009, 05:30:51 Uhr
>> *Betreff:* Re: [pjsip] debug tips for cpu usage going wrong, deadlock
>> issues, within pjsip or not. win32 specific thread profiling function used.
>>
>> http://trac.pjsip.org/repos/wiki/PJSUA_Locks
>>
>> Is this guide line useful?
>> Indeed, it is still possible to create some SIP message flow to cause
>> PJSUA or PJSIP-UA deadlock.But this is another topic.
>>
>> regards,
>> Gang
>>
>> On Mon, May 18, 2009 at 4:11 PM, M.S. <hamstiede at yahoo.de> wrote:
>>
>>>  thank you for your detailed debug summary. I am using linux but
>>> __FILE__ and __LINE__ is working too.
>>> For the thread tagging, i will use the linux pids (process-id).
>>>
>>> I thought i have a deadlock because:
>>> - if i have 1 or 2 connections(calls) i have < 5 % processor use (small
>>> arm system).
>>> -if i have 4 connections if get (only for one thread) 80-100% processor
>>> use. (after a view seconds).
>>> (for each media stream i use a separate conference bridge. all this
>>> threads will used only (<2% cpu))
>>> - after this i get some deadlock informations form the stack like:
>>>
>>> possible deadlock ........
>>>
>>> Questions:
>>> it is possible that this could happened  if  any stack state machine
>>> checks his own  mutex like "bool IsPjsuaLocked()" ?
>>>
>>> I think i will build this stuff your wrote in the pjlib (thread tagging
>>> and detaild mutex logging with __FILE__ and __LINE__)
>>>
>>>
>>> thank you
>>>
>>>     Mark
>>>
>>>
>>>  ------------------------------
>>> *Von:* David Clark <vdc1048 at tx.rr.com>
>>> *An:* pjsip list <pjsip at lists.pjsip.org>; pjsip list <
>>> pjsip at lists.pjsip.org>
>>> *Gesendet:* Samstag, den 16. Mai 2009, 09:08:42 Uhr
>>> *Betreff:* [pjsip] debug tips for cpu usage going wrong, deadlock
>>> issues, within pjsip or not. win32 specific thread profiling function used.
>>>
>>> No 100% might well be a bug but most likely not deadlock.  deadlock is
>>> where you are locked by the OS waiting for a mutex you will never get.
>>> This locked state uses 0% cpu.  The usual symptom I see is after x number
>>> of calls.   pjsip just does not communicate with anyone.  Even SJPhone
>>> the sip phone application is ignored.
>>>
>>> But how to resolve your cpu goes to 100% issue.  Yea not putting sleep(1)
>>> in some of your loops might be the cause of it.  But if you have a
>>> multithread
>>> program in windows that is using 100% of the cpu and you don't know why.
>>> I got some tricks to tell you where.  Tell you which thread is using
>>> that cpu and rule out others.
>>>
>>> The solution is to create a linked list of every thread in the box yorus
>>> and pjsip threads.  You can do this with a wrapper function around
>>> CreateThread()
>>> for your stuff.  You can do this by having pjsip's win32 create thread
>>> function call a function which simply adds the thread pjsip created to your
>>> linked list.
>>>
>>> Ok got your linked list.  Next step.
>>> At regular intervals like say once a minute call a function which for
>>> each thread in your linked list calls this function:
>>>
>>> /************************************************************************/
>>> /*
>>> */
>>> /* Given a thread handle this function will determine the percent of
>>> */
>>> /* time the thread has spent in user mode and kernal mode.
>>> */
>>> /*
>>> */
>>>
>>> /************************************************************************/
>>> int thread_percent(HANDLE thread, int *user_mode, int *kernel_mode)
>>> {
>>>     __int64 user_percent;
>>>     __int64 kernel_percent;
>>>
>>>     LARGE_INTEGER process_utime;
>>>     LARGE_INTEGER process_ktime;
>>>     LARGE_INTEGER thread_utime;
>>>     LARGE_INTEGER thread_ktime;
>>>
>>>     FILETIME process_creation_time;
>>>     FILETIME process_exit_time;
>>>     FILETIME process_user_time;
>>>     FILETIME process_kernel_time;
>>>
>>>     FILETIME thread_creation_time;
>>>     FILETIME thread_exit_time;
>>>     FILETIME thread_user_time;
>>>     FILETIME thread_kernel_time;
>>>
>>>     BOOL retval1, retval2;
>>>     int retval=0; // assume failure.
>>>
>>>
>>>     retval1=GetProcessTimes(GetCurrentProcess(),
>>>                     &process_creation_time,
>>>                     &process_exit_time,
>>>                     &process_kernel_time,
>>>                     &process_user_time);
>>>
>>>     retval2=GetThreadTimes(thread,
>>>                    &thread_creation_time,
>>>                    &thread_exit_time,
>>>                    &thread_kernel_time,
>>>                    &thread_user_time);
>>>
>>>     if ((retval1) && (retval2)) // if both functions worked.
>>>     {
>>>         memcpy(&process_utime, &process_user_time, sizeof(FILETIME));
>>>         memcpy(&process_ktime, &process_kernel_time, sizeof(FILETIME));
>>>         memcpy(&thread_utime, &thread_user_time, sizeof(FILETIME));
>>>         memcpy(&thread_ktime, &thread_kernel_time, sizeof(FILETIME));
>>>
>>>                   if (process_utime.QuadPart==0)
>>>                            user_percent=0;
>>>                   else
>>>
>>> user_percent=thread_utime.QuadPart*100/process_utime.QuadPart;
>>>         if (process_ktime.QuadPart==0)
>>>                  kernel_percent=0;
>>>         else
>>>
>>> kernel_percent=thread_ktime.QuadPart*100/process_ktime.QuadPart;
>>>
>>>         *user_mode=(int)user_percent;
>>>         *kernel_mode=(int)kernel_percent;
>>>         retval=1;
>>>     }
>>>     return(retval);
>>> }
>>> You will find most threads will be 0 and 0 and one will be significantly
>>> higher say 30 and 40.  The one that is significantly higher is the offender.
>>>
>>> Hope this helps and makes sense.
>>>
>>> On the deadlock debugging I did do.  I did this.  I moved pjsip debug
>>> level at compile time and runtime to 6.
>>> But that was still not enough to tell me what I needed to know.  So I
>>> augumented the mutex lock/unlock/try/destroy functions in pjilib
>>> with a version that gave the callers __FILE__, and __LINE__.  __FILE__
>>> gives the source module of the caller, and __LINE__ gives the
>>> line number the function was called from.
>>>
>>> I did something like this:
>>> #define mutex_lock(lock) _mutex_lock(lock, __FILE__, __LINE__)
>>>
>>> _mutex_lock(multex *lock, char *file, int line)
>>> {
>>>    // in here I added file, and line to the level 6 debug output.
>>> }
>>>
>>> Run that and you know where you blocked and where you locked prior to
>>> that which gives the complete story.
>>>
>>> David Clark
>>> ps. Note this does produce a ton of debug logs.  We are talking
>>> gigabytes.   In my applications I have been able to
>>> give mutex locking information as part of the thread usage reports.  I
>>> list what thread is blocked on which mutex
>>> where the block happened and where ownership was last successfully
>>> obtained.  The complete story on one line.
>>> I have not put this into pjsip yet.  It involved replacing all mutex
>>> functions, and since my application never uses
>>> a try mutex function, I don't have that one.  And pjsip uses that
>>> heavily.  So I would need to add that function.
>>>
>>> At 02:54 AM 5/15/2009, M.S. wrote:
>>>
>>> i think you have great skills to debug  deadlock situations in pjsip.
>>> can you give me a  good advice  to debug pjsip/pjsua mutex stuff
>>> (log-level, debug outputs etc.)
>>>
>>> i have a situation that pjsip got 100% processor use. i think i have a
>>> deadlock.
>>>
>>>   regards
>>>     mark
>>>
>>>
>>> *Von:* David Clark <vdc1048 at tx.rr.com>
>>> *An:* pjsip list <pjsip at lists.pjsip.org>; pjsip list <
>>> pjsip at lists.pjsip.org>
>>> *Gesendet:* Donnerstag, den 14. Mai 2009, 22:22:56 Uhr
>>> *Betreff:* [pjsip] pjsip version 1.1 gotcha #2 with work around.
>>>
>>> This is similar to gotcha #1.  So I will go into less detail.  If the
>>> on_call_state() handler which gets hangup
>>> notifications calls pjsua_recorder_destroy().  The box can stop
>>> communicating after that for the following reason.
>>> 1) pjsua_recorder_destroy() locks pjsua mutex.
>>> 2) pjsua_recorder_destroy() locals conf mutex indirectly by calling
>>> pjsua_conf_remove_port().
>>>
>>> But if conf mutex is locked in port audio clock thread get_frame() at the
>>> point of the call.
>>>
>>> You will block waiting for conf mutex and other threads will never get
>>> pjsua mutex.
>>>
>>> Work around: don't call pjsua_recorder_destroy() in on_call_state()
>>> signal the application thread to wait up and do it
>>> in the application thread.
>>>
>>> David Clark
>>>
>>> _______________________________________________
>>> Visit our blog: http://blog.pjsip.org
>>>
>>> pjsip mailing list
>>> pjsip at lists.pjsip.org
>>> http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Visit our blog: http://blog.pjsip.org
>>>
>>> pjsip mailing list
>>> pjsip at lists.pjsip.org
>>> http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org
>>>
>>>
>>
>>
>> _______________________________________________
>> Visit our blog: http://blog.pjsip.org
>>
>> pjsip mailing list
>> pjsip at lists.pjsip.org
>> http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pjsip.org/pipermail/pjsip_lists.pjsip.org/attachments/20090520/ad2c18f6/attachment-0001.html>