Re: cosd multi-second stalls cause "wrongly marked me down"

"Jim Schutt" <jaschut@xxxxxxxxxx> · Fri, 1 Apr 2011 16:38:32 -0600

Sage Weil wrote:
On Thu, 31 Mar 2011, Jim Schutt wrote:

I think instead I need to use TryLock() on the pipe_lock
in submit_message(), in a loop with a suitable sleep,
say 100us, and assert when it takes too long to acquire
the lock.

So, maybe add a Mutex::LockOrAbort(), and use it in
submit_message()?

submit_message() is intended to return immediately, no?
And the issue is caused by heartbeat() being unable to
queue messages, so this sounds to me to be a useful
test.

Does that seem to have low enough overhead to
be useful?

Yeah, that sounds right!

I gave this a try with LockOrAbort using a 5 second
timeout.

When you ignore all the threads waiting on a condition
variable, or in poll, this is what is left:

# egrep -v "__poll|pthread_cond_|__lll_lock_wait" thread-ids.txt
  395 Thread 18894  0x00007f60f804dd83 in do_writev (fd=11, vector=0x7f60f328fd00, count=5) at ../sysdeps/unix/sysv/linux/writev.c:46
  350 Thread 23189  0x00007f60f8ee9f2b in sendmsg () from /lib64/libpthread.so.0
  69 Thread 20612  0x00007f60f8ee9f2b in sendmsg () from /lib64/libpthread.so.0
  41 Thread 20474  0x00007f60f8ee9f2b in sendmsg () from /lib64/libpthread.so.0
  17 Thread 20649  0x00007f60f8ee9f2b in sendmsg () from /lib64/libpthread.so.0
* 1 Thread 20155  0x00007f60f8eea9dd in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41

I attached some text files with gdb output for
the stack trace of the aborting thread, a list of the
thread ids, and all the thread traces.

But, I haven't learned anything from this yet that helps
figure out the cause of the delay.

Can you think of anything I should try?

-- Jim

sage

Attachment:
thread-abort.txt.bz2

Description: application/bzip
Attachment:
thread-ids.txt.bz2

Description: application/bzip
Attachment:
thread-stacks.txt.bz2

Description: application/bzip