Re: 0.40 OSD - Address family not supported by protocol

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sage,

Thanks for sorting out the root cause!

-Josh

Sage Weil wrote:
> Hi Josh,
> 
> I just sorted this out.  The problem was that the encoding for 
> OSDSuperblock was changed, and that struct was embedded in the MOSDBoot 
> message.  Some of your OSDs restarted befor the monitors, so the old 
> monitors saw the new structure and misdecoded the message with garbage 
> (well, zeros) for the heartbeat address.  This made it into the OSDMap, 
> and a very impolite assert in the messenger code made the process crash 
> when it got an error from socket(2).
> 
> The assert and error handling is cleaned up.  There isn't a nice way to 
> fix the behavior of the old code, though, so for everyone else: 
> upgrade/restart the monitors before the osds to avoid triggering this.  If 
> you do, restarting the OSDs (possibly a couple of times) will clear it up.  
> Once all of the ':/0' values disappear from 'ceph osd dump' you're in the 
> clear.
> 
> sage
> 
> 
> http://tracker.newdream.net/issues/1942
> 
> On Sat, 14 Jan 2012, Josh Pieper wrote:
> 
> > I just upgraded our test cluster to 0.40, and immediately after
> > starting up get asserts in all the OSDs.  I've inlined a relevant
> > backtrace below, is there anything else that would be useful for
> > debugging?
> > 
> > Our test cluster is 3 ubuntu 11.10 amd64 machines, each with a mon and
> > osd.
> > 
> > Looking at an strace, it is pretty clearly asking for an invalid
> > address family, although I'm not sure where it is coming from.
> > 
> > [pid 30648] socket(PF_UNSPEC, SOCK_STREAM, 0 <unfinished ...>
> > [pid 30648] <... socket resumed> )      = -1 EAFNOSUPPORT (Address family not supported by protocol)
> > 
> > -Josh
> > 
> > -------
> > 2012-01-14 09:31:03.395266 7f67edf08700 -- 10.1.10.71:6801/27529 >> 10.1.10.73:6801/8127 pipe(0x14e0780 sd=19 pgs=0 cs=0 l=0).connect claims to be 10.1.10.73:6801/24029 not 10.1.10.73:6801/8127 - wrong node!
> > 2012-01-14 09:31:03.395579 7f67ede07700 -- :/27530 >> :/0 pipe(0x14e0500 sd=-1 pgs=0 cs=0 l=0).connect couldn't created socket Address family not supported by protocol
> > msg/SimpleMessenger.cc: In function 'int SimpleMessenger::Pipe::connect()', in thread '7f67ede07700'
> > msg/SimpleMessenger.cc: 1038: FAILED assert(0)
> >  ceph version 0.40 (commit:7eea40ea37fb3a68a2042a2218c9b8c9c40a843e)
> >  1: (SimpleMessenger::Pipe::connect()+0x87c) [0x5be81c]
> >  2: (SimpleMessenger::Pipe::writer()+0x456) [0x5c1536]
> >  3: (SimpleMessenger::Pipe::Writer::entry()+0xd) [0x4b228d]
> >  4: (()+0x7efc) [0x7f67ffdf4efc]
> >  5: (clone()+0x6d) [0x7f67fe42589d]
> >  ceph version 0.40 (commit:7eea40ea37fb3a68a2042a2218c9b8c9c40a843e)
> >  1: (SimpleMessenger::Pipe::connect()+0x87c) [0x5be81c]
> >  2: (SimpleMessenger::Pipe::writer()+0x456) [0x5c1536]
> >  3: (SimpleMessenger::Pipe::Writer::entry()+0xd) [0x4b228d]
> >  4: (()+0x7efc) [0x7f67ffdf4efc]
> >  5: (clone()+0x6d) [0x7f67fe42589d]
> > *** Caught signal (Aborted) **
> >  in thread 7f67ede07700
> >  ceph version 0.40 (commit:7eea40ea37fb3a68a2042a2218c9b8c9c40a843e)
> >  1: /usr/bin/ceph-osd() [0x5fd926]
> >  2: (()+0x10060) [0x7f67ffdfd060]
> >  3: (gsignal()+0x35) [0x7f67fe37a3a5]
> >  4: (abort()+0x17b) [0x7f67fe37db0b]
> >  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f67fec38d7d]
> >  6: (()+0xb9f26) [0x7f67fec36f26]
> >  7: (()+0xb9f53) [0x7f67fec36f53]
> >  8: (()+0xba04e) [0x7f67fec3704e]
> >  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x193) [0x5cfd33]
> >  10: (SimpleMessenger::Pipe::connect()+0x87c) [0x5be81c]
> >  11: (SimpleMessenger::Pipe::writer()+0x456) [0x5c1536]
> >  12: (SimpleMessenger::Pipe::Writer::entry()+0xd) [0x4b228d]
> >  13: (()+0x7efc) [0x7f67ffdf4efc]
> >  14: (clone()+0x6d) [0x7f67fe42589d]
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 

-- 
Shaw's Principle:
	Build a system that even a fool can use, and only a fool will
	want to use it.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux