Sage Weil wrote:
On Thu, 31 Mar 2011, Jim Schutt wrote:
I think instead I need to use TryLock() on the pipe_lock in submit_message(), in a loop with a suitable sleep, say 100us, and assert when it takes too long to acquire the lock. So, maybe add a Mutex::LockOrAbort(), and use it in submit_message()? submit_message() is intended to return immediately, no? And the issue is caused by heartbeat() being unable to queue messages, so this sounds to me to be a useful test. Does that seem to have low enough overhead to be useful?Yeah, that sounds right!
I gave this a try with LockOrAbort using a 5 second timeout. When you ignore all the threads waiting on a condition variable, or in poll, this is what is left: # egrep -v "__poll|pthread_cond_|__lll_lock_wait" thread-ids.txt 395 Thread 18894 0x00007f60f804dd83 in do_writev (fd=11, vector=0x7f60f328fd00, count=5) at ../sysdeps/unix/sysv/linux/writev.c:46 350 Thread 23189 0x00007f60f8ee9f2b in sendmsg () from /lib64/libpthread.so.0 69 Thread 20612 0x00007f60f8ee9f2b in sendmsg () from /lib64/libpthread.so.0 41 Thread 20474 0x00007f60f8ee9f2b in sendmsg () from /lib64/libpthread.so.0 17 Thread 20649 0x00007f60f8ee9f2b in sendmsg () from /lib64/libpthread.so.0 * 1 Thread 20155 0x00007f60f8eea9dd in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41 I attached some text files with gdb output for the stack trace of the aborting thread, a list of the thread ids, and all the thread traces. But, I haven't learned anything from this yet that helps figure out the cause of the delay. Can you think of anything I should try? -- Jim
sage
Attachment:
thread-abort.txt.bz2
Description: application/bzip
Attachment:
thread-ids.txt.bz2
Description: application/bzip
Attachment:
thread-stacks.txt.bz2
Description: application/bzip