On Wed, 2011-03-02 at 19:26 -0700, Colin McCabe wrote: > Hi Jim, > > We have seen this problem before. The usual suspects are the oom > killer (grep for "out of memory" in syslog). > Unfortunately, SIGKILL is uncatchable and that's what the OOM killer sends. > > Another problem that can prevent core files from being generated is > bad ulimit -c settings or a bad setting for core_pattern and friends. > One problem I have a lot too is that the partition I'm writing core > files to fills up. > > If none of that works, it's possible that someone is calling exit() > somewhere. You can attach a gdb to the process and put a breakpoint on > exit() to see if this is going on. There's a lot of "your foo is not > bar enough, I hate your config, exit(1)" type code that gets executed > while the daemon is starting up. It sounds like you should be past > that point, though. I've finally gotten a little info, using a variant of your gdb idea: I waited until many of the OSD instances had died, then I attached gdb to several that were left, and waited. Two of them died the same way, like this: Program received signal SIGPIPE, Broken pipe. [Switching to Thread 0x7fd7888c8940 (LWP 28693)] 0x00007fd7a9b82f2b in sendmsg () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007fd7a9b82f2b in sendmsg () from /lib64/libpthread.so.0 #1 0x0000000000672e0b in SimpleMessenger::Pipe::do_sendmsg ( this=0x7fd799b67c20, sd=13, msg=0x7fd7888c7f20, len=251237, more=false) at msg/SimpleMessenger.cc:1994 #2 0x00000000006739d3 in SimpleMessenger::Pipe::write_message ( this=0x7fd799b67c20, m=0x7fd79b2dcb70) at msg/SimpleMessenger.cc:2217 #3 0x000000000067e74a in SimpleMessenger::Pipe::writer (this=0x7fd799b67c20) at msg/SimpleMessenger.cc:1734 #4 0x000000000066fa2b in SimpleMessenger::Pipe::Writer::entry ( this=0x7fd799b67e70) at msg/SimpleMessenger.h:204 #5 0x000000000068282e in Thread::_entry_func (arg=0x7fd799b67e70) at ./common/Thread.h:41 #6 0x00007fd7a9b7b73d in start_thread (arg=<value optimized out>) at pthread_create.c:301 #7 0x00007fd7a8a91f6d in clone () from /lib64/libc.so.6 (gdb) Program received signal SIGPIPE, Broken pipe. [Switching to Thread 0x7f1aed7f3940 (LWP 28726)] 0x00007f1b01238f2b in sendmsg () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f1b01238f2b in sendmsg () from /lib64/libpthread.so.0 #1 0x0000000000672e0b in SimpleMessenger::Pipe::do_sendmsg ( this=0x7f1af15c94d0, sd=114, msg=0x7f1aed7f2f20, len=126728, more=false) at msg/SimpleMessenger.cc:1994 #2 0x00000000006739d3 in SimpleMessenger::Pipe::write_message ( this=0x7f1af15c94d0, m=0x23d3010) at msg/SimpleMessenger.cc:2217 #3 0x000000000067e74a in SimpleMessenger::Pipe::writer (this=0x7f1af15c94d0) at msg/SimpleMessenger.cc:1734 #4 0x000000000066fa2b in SimpleMessenger::Pipe::Writer::entry ( this=0x7f1af15c9720) at msg/SimpleMessenger.h:204 #5 0x000000000068282e in Thread::_entry_func (arg=0x7f1af15c9720) at ./common/Thread.h:41 #6 0x00007f1b0123173d in start_thread (arg=<value optimized out>) at pthread_create.c:301 #7 0x00007f1b00147f6d in clone () from /lib64/libc.so.6 The third also got Program received signal SIGPIPE, Broken pipe. [Switching to Thread 0x7f531fefe940 (LWP 28700)] 0x00007f533ffeaf2b in sendmsg () from /lib64/libpthread.so.0 (gdb) but something was a little different and I didn't get a backtrace from it. -- Jim > > Colin > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html