Hi Josh, I just sorted this out. The problem was that the encoding for OSDSuperblock was changed, and that struct was embedded in the MOSDBoot message. Some of your OSDs restarted befor the monitors, so the old monitors saw the new structure and misdecoded the message with garbage (well, zeros) for the heartbeat address. This made it into the OSDMap, and a very impolite assert in the messenger code made the process crash when it got an error from socket(2). The assert and error handling is cleaned up. There isn't a nice way to fix the behavior of the old code, though, so for everyone else: upgrade/restart the monitors before the osds to avoid triggering this. If you do, restarting the OSDs (possibly a couple of times) will clear it up. Once all of the ':/0' values disappear from 'ceph osd dump' you're in the clear. sage http://tracker.newdream.net/issues/1942 On Sat, 14 Jan 2012, Josh Pieper wrote: > I just upgraded our test cluster to 0.40, and immediately after > starting up get asserts in all the OSDs. I've inlined a relevant > backtrace below, is there anything else that would be useful for > debugging? > > Our test cluster is 3 ubuntu 11.10 amd64 machines, each with a mon and > osd. > > Looking at an strace, it is pretty clearly asking for an invalid > address family, although I'm not sure where it is coming from. > > [pid 30648] socket(PF_UNSPEC, SOCK_STREAM, 0 <unfinished ...> > [pid 30648] <... socket resumed> ) = -1 EAFNOSUPPORT (Address family not supported by protocol) > > -Josh > > ------- > 2012-01-14 09:31:03.395266 7f67edf08700 -- 10.1.10.71:6801/27529 >> 10.1.10.73:6801/8127 pipe(0x14e0780 sd=19 pgs=0 cs=0 l=0).connect claims to be 10.1.10.73:6801/24029 not 10.1.10.73:6801/8127 - wrong node! > 2012-01-14 09:31:03.395579 7f67ede07700 -- :/27530 >> :/0 pipe(0x14e0500 sd=-1 pgs=0 cs=0 l=0).connect couldn't created socket Address family not supported by protocol > msg/SimpleMessenger.cc: In function 'int SimpleMessenger::Pipe::connect()', in thread '7f67ede07700' > msg/SimpleMessenger.cc: 1038: FAILED assert(0) > ceph version 0.40 (commit:7eea40ea37fb3a68a2042a2218c9b8c9c40a843e) > 1: (SimpleMessenger::Pipe::connect()+0x87c) [0x5be81c] > 2: (SimpleMessenger::Pipe::writer()+0x456) [0x5c1536] > 3: (SimpleMessenger::Pipe::Writer::entry()+0xd) [0x4b228d] > 4: (()+0x7efc) [0x7f67ffdf4efc] > 5: (clone()+0x6d) [0x7f67fe42589d] > ceph version 0.40 (commit:7eea40ea37fb3a68a2042a2218c9b8c9c40a843e) > 1: (SimpleMessenger::Pipe::connect()+0x87c) [0x5be81c] > 2: (SimpleMessenger::Pipe::writer()+0x456) [0x5c1536] > 3: (SimpleMessenger::Pipe::Writer::entry()+0xd) [0x4b228d] > 4: (()+0x7efc) [0x7f67ffdf4efc] > 5: (clone()+0x6d) [0x7f67fe42589d] > *** Caught signal (Aborted) ** > in thread 7f67ede07700 > ceph version 0.40 (commit:7eea40ea37fb3a68a2042a2218c9b8c9c40a843e) > 1: /usr/bin/ceph-osd() [0x5fd926] > 2: (()+0x10060) [0x7f67ffdfd060] > 3: (gsignal()+0x35) [0x7f67fe37a3a5] > 4: (abort()+0x17b) [0x7f67fe37db0b] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f67fec38d7d] > 6: (()+0xb9f26) [0x7f67fec36f26] > 7: (()+0xb9f53) [0x7f67fec36f53] > 8: (()+0xba04e) [0x7f67fec3704e] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x193) [0x5cfd33] > 10: (SimpleMessenger::Pipe::connect()+0x87c) [0x5be81c] > 11: (SimpleMessenger::Pipe::writer()+0x456) [0x5c1536] > 12: (SimpleMessenger::Pipe::Writer::entry()+0xd) [0x4b228d] > 13: (()+0x7efc) [0x7f67ffdf4efc] > 14: (clone()+0x6d) [0x7f67fe42589d] > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html