Re: cosd multi-second stalls cause "wrongly marked me down"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2011-03-02 at 19:26 -0700, Colin McCabe wrote:
> Hi Jim,
> 
> We have seen this problem before. The usual suspects are the oom
> killer (grep for "out of memory" in syslog).
> Unfortunately, SIGKILL is uncatchable and that's what the OOM killer sends.
> 
> Another problem that can prevent core files from being generated is
> bad ulimit -c settings or a bad setting for core_pattern and friends.
> One problem I have a lot too is that the partition I'm writing core
> files to fills up.
> 
> If none of that works, it's possible that someone is calling exit()
> somewhere. You can attach a gdb to the process and put a breakpoint on
> exit() to see if this is going on. There's a lot of "your foo is not
> bar enough, I hate your config, exit(1)" type code that gets executed
> while the daemon is starting up. It sounds like you should be past
> that point, though.

I've finally gotten a little info, using a variant of
your gdb idea: I waited until many of the OSD instances
had died, then I attached gdb to several that were left,
and waited.

Two of them died the same way, like this:

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7fd7888c8940 (LWP 28693)]
0x00007fd7a9b82f2b in sendmsg () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007fd7a9b82f2b in sendmsg () from /lib64/libpthread.so.0
#1  0x0000000000672e0b in SimpleMessenger::Pipe::do_sendmsg (
    this=0x7fd799b67c20, sd=13, msg=0x7fd7888c7f20, len=251237, more=false)
    at msg/SimpleMessenger.cc:1994
#2  0x00000000006739d3 in SimpleMessenger::Pipe::write_message (
    this=0x7fd799b67c20, m=0x7fd79b2dcb70) at msg/SimpleMessenger.cc:2217
#3  0x000000000067e74a in SimpleMessenger::Pipe::writer (this=0x7fd799b67c20)
    at msg/SimpleMessenger.cc:1734
#4  0x000000000066fa2b in SimpleMessenger::Pipe::Writer::entry (
    this=0x7fd799b67e70) at msg/SimpleMessenger.h:204
#5  0x000000000068282e in Thread::_entry_func (arg=0x7fd799b67e70)
    at ./common/Thread.h:41
#6  0x00007fd7a9b7b73d in start_thread (arg=<value optimized out>)
    at pthread_create.c:301
#7  0x00007fd7a8a91f6d in clone () from /lib64/libc.so.6
(gdb) 


Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7f1aed7f3940 (LWP 28726)]
0x00007f1b01238f2b in sendmsg () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f1b01238f2b in sendmsg () from /lib64/libpthread.so.0
#1  0x0000000000672e0b in SimpleMessenger::Pipe::do_sendmsg (
    this=0x7f1af15c94d0, sd=114, msg=0x7f1aed7f2f20, len=126728, more=false)
    at msg/SimpleMessenger.cc:1994
#2  0x00000000006739d3 in SimpleMessenger::Pipe::write_message (
    this=0x7f1af15c94d0, m=0x23d3010) at msg/SimpleMessenger.cc:2217
#3  0x000000000067e74a in SimpleMessenger::Pipe::writer (this=0x7f1af15c94d0)
    at msg/SimpleMessenger.cc:1734
#4  0x000000000066fa2b in SimpleMessenger::Pipe::Writer::entry (
    this=0x7f1af15c9720) at msg/SimpleMessenger.h:204
#5  0x000000000068282e in Thread::_entry_func (arg=0x7f1af15c9720)
    at ./common/Thread.h:41
#6  0x00007f1b0123173d in start_thread (arg=<value optimized out>)
    at pthread_create.c:301
#7  0x00007f1b00147f6d in clone () from /lib64/libc.so.6

The third also got 

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7f531fefe940 (LWP 28700)]
0x00007f533ffeaf2b in sendmsg () from /lib64/libpthread.so.0
(gdb) 

but something was a little different and I didn't get a 
backtrace from it.

-- Jim


> 
> Colin
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux