hi,all I use ceph v0.30 on 31osds, on linux 2.6.37 after i set up the whole cluster, there are many (10) osds going down because the cosd process was killed, and we can provide the osd log in attach file "osd-failed". and this phenomenon occured once a week ago.At first we fixed it by just rebuilding the cluster, but this time we will not try that method. we want to find where lead this failed happen. why did the simplemessenger always send RETSETSESSION? whar lead the boost:recovery failed ? can you give some constructive advices? thanks in advance
2011-07-11 16:22:47.580065 4eafb950 -- 192.168.0.118:6800/26500 >> 192.168.0.212:0/1997725251 pipe(0xe684780 sd=16 pgs=0 cs=0 l=0).a ccept peer addr is really 192.168.0.212:0/1997725251 (socket is 192.168.0.212:41488/0) 2011-07-11 16:22:47.590243 50b1b950 -- 192.168.0.118:6800/26500 >> 192.168.0.206:0/1951212382 pipe(0xe684000 sd=17 pgs=0 cs=0 l=0).a ccept peer addr is really 192.168.0.206:0/1951212382 (socket is 192.168.0.206:51650/0) 2011-07-11 16:22:47.591856 4eeff950 -- 192.168.0.118:6800/26500 >> 192.168.0.213:0/2033517006 pipe(0xad7b280 sd=18 pgs=0 cs=0 l=0).a ccept peer addr is really 192.168.0.213:0/2033517006 (socket is 192.168.0.213:52987/0) 2011-07-11 16:22:47.600186 57989950 -- 192.168.0.118:6800/26500 >> 192.168.0.214:0/10984431 pipe(0xad7b500 sd=19 pgs=0 cs=0 l=0).acc ept peer addr is really 192.168.0.214:0/10984431 (socket is 192.168.0.214:35701/0) 2011-07-11 16:22:47.835041 5afbf950 -- 192.168.0.118:6802/26500 >> 192.168.0.109:6805/14310 pipe(0x2c4b000 sd=20 pgs=0 cs=0 l=0).acc ept we reset (peer sent cseq 2), sending RESETSESSION 2011-07-11 16:22:48.819890 4eeff950 -- 192.168.0.118:6801/26500 >> 192.168.0.109:6804/14310 pipe(0x2c4bc80 sd=12 pgs=0 cs=0 l=0).acc ept we reset (peer sent cseq 2), sending RESETSESSION 2011-07-11 16:22:49.167883 4e8f9950 -- 192.168.0.118:6802/26500 >> 192.168.0.106:6805/14457 pipe(0x2c4b780 sd=14 pgs=0 cs=0 l=0).acc ept we reset (peer sent cseq 2), sending RESETSESSION 2011-07-11 16:22:49.179942 57989950 -- 192.168.0.118:6802/26500 >> 192.168.0.105:6805/10603 pipe(0xe726a00 sd=15 pgs=0 cs=0 l=0).acc ept we reset (peer sent cseq 2), sending RESETSESSION 2011-07-11 16:22:49.526557 58090950 -- 192.168.0.118:6802/26500 >> 192.168.0.106:6802/14367 pipe(0xe726280 sd=18 pgs=0 cs=0 l=0).acc ept we reset (peer sent cseq 2), sending RESETSESSION 2011-07-11 16:22:50.280940 4eeff950 -- 192.168.0.118:6801/26500 >> 192.168.0.109:6804/14310 pipe(0x2c4bc80 sd=12 pgs=1162 cs=1 l=0). fault with nothing to send, going to standby 2011-07-11 16:22:50.353199 5afbf950 -- 192.168.0.118:6800/26500 >> 192.168.0.207:0/1569463766 pipe(0xad7b780 sd=20 pgs=0 cs=0 l=0).a ccept peer addr is really 192.168.0.207:0/1569463766 (socket is 192.168.0.207:59047/0) 2011-07-11 16:22:50.353827 56676950 -- 192.168.0.118:6800/26500 >> 192.168.0.210:0/2923411330 pipe(0xad7bc80 sd=48 pgs=0 cs=0 l=0).a ccept peer addr is really 192.168.0.210:0/2923411330 (socket is 192.168.0.210:58943/0) 2011-07-11 16:22:50.356753 56f7f950 -- 192.168.0.118:6800/26500 >> 192.168.0.213:0/2033517006 pipe(0xf051280 sd=50 pgs=0 cs=0 l=0).a ccept peer addr is really 192.168.0.213:0/2033517006 (socket is 192.168.0.213:52989/0) 2011-07-11 16:22:50.359422 56e7e950 -- 192.168.0.118:6800/26500 >> 192.168.0.214:0/10984431 pipe(0xf051000 sd=49 pgs=0 cs=0 l=0).acc ept peer addr is really 192.168.0.214:0/10984431 (socket is 192.168.0.214:35703/0) 2011-07-11 16:22:50.360825 57585950 -- 192.168.0.118:6800/26500 >> 192.168.0.211:0/1980302047 pipe(0xacfc780 sd=51 pgs=0 cs=0 l=0).a ccept peer addr is really 192.168.0.211:0/1980302047 (socket is 192.168.0.211:42706/0) 2011-07-11 16:22:50.361602 57e8e950 -- 192.168.0.118:6800/26500 >> 192.168.0.212:0/1997725251 pipe(0xacfc500 sd=52 pgs=0 cs=0 l=0).a ccept peer addr is really 192.168.0.212:0/1997725251 (socket is 192.168.0.212:41489/0) 2011-07-11 16:22:50.387648 58696950 -- 192.168.0.118:6800/26500 >> 192.168.0.208:0/115956664 pipe(0xacfc000 sd=53 pgs=0 cs=0 l=0).ac cept peer addr is really 192.168.0.208:0/115956664 (socket is 192.168.0.208:57194/0) 2011-07-11 16:22:50.434862 58b9b950 -- 192.168.0.118:6800/26500 >> 192.168.0.206:0/1951212382 pipe(0xacfc280 sd=54 pgs=0 cs=0 l=0).a ccept peer addr is really 192.168.0.206:0/1951212382 (socket is 192.168.0.206:51652/0) 2011-07-11 16:22:50.445896 58f9f950 -- 192.168.0.118:6800/26500 >> 192.168.0.209:0/300110026 pipe(0xacfcc80 sd=55 pgs=0 cs=0 l=0).ac cept peer addr is really 192.168.0.209:0/300110026 (socket is 192.168.0.209:38556/0) 2011-07-11 16:22:50.999712 57787950 -- 192.168.0.118:6801/26500 >> 192.168.0.109:6804/14310 pipe(0x2c4bc80 sd=12 pgs=1162 cs=2 l=0). connect got RESETSESSION 2011-07-11 16:22:52.946266 4b6f1950 log [INF] : 1.4c7 scrub ok osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState:: RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na , mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_m ode)0u>::my_context)', in thread '0x49cec950' osd/PG.cc: 3882: FAILED assert(0 == "we got a bad state machine event") 1: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, bo ost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::n a, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context) +0xb6) [0x562116] 2: (boost::statechart::detail::inner_constructor<boost::mpl::l_item<mpl_::long_<1l>, PG::RecoveryState::Crashed, boost::mpl::l_end> , boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::stat echart::null_exception_translator> >::construct(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoverySta te::Initial, std::allocator<void>, boost::statechart::null_exception_translator>* const&, boost::statechart::state_machine<PG::Recov eryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>&)+0x26) [ 0x59fb86] 3: (boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boos t::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc8) [0x5a0448] 4: (boost::statechart::simple_state<PG::RecoveryState::Primary, PG::RecoveryState::Started, PG::RecoveryState::Peering, (boost::sta techart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xf9) [0x5a2e49] 5: (boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::sta techart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x99) [0x5a3f19] 6: (boost::statechart::simple_state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mp l_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_ ::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x9e) [0x5a516e] 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::s tatechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x5a1dab] 8: (PG::RecoveryState::handle_log(int, MOSDPGLog*, PG::RecoveryCtx*)+0x14a) [0x577c0a] 9: (OSD::handle_pg_log(MOSDPGLog*)+0x344) [0x51a064] 10: (OSD::_dispatch(Message*)+0x4ed) [0x5232ad] 11: (OSD::ms_dispatch(Message*)+0xd9) [0x523cf9] 12: (SimpleMessenger::dispatch_entry()+0x8e3) [0x6175f3] 13: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49140c] 14: /lib/libpthread.so.0 [0x7fae8e8a0fc7] 15: (clone()+0x6d) [0x7fae8d51164d] 1: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, bo ost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::n a, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context) +0xb6) [0x562116] 2: (boost::statechart::detail::inner_constructor<boost::mpl::l_item<mpl_::long_<1l>, PG::RecoveryState::Crashed, boost::mpl::l_end> , boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::stat echart::null_exception_translator> >::construct(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoverySta te::Initial, std::allocator<void>, boost::statechart::null_exception_translator>* const&, boost::statechart::state_machine<PG::Recov eryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>&)+0x26) [ 0x59fb86] 3: (boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boos t::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc8) [0x5a0448] 4: (boost::statechart::simple_state<PG::RecoveryState::Primary, PG::RecoveryState::Started, PG::RecoveryState::Peering, (boost::sta techart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xf9) [0x5a2e49] 5: (boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::sta techart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x99) [0x5a3f19] 6: (boost::statechart::simple_state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mp l_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_ ::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x9e) [0x5a516e] 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::s tatechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x5a1dab] 8: (PG::RecoveryState::handle_log(int, MOSDPGLog*, PG::RecoveryCtx*)+0x14a) [0x577c0a] 9: (OSD::handle_pg_log(MOSDPGLog*)+0x344) [0x51a064] 10: (OSD::_dispatch(Message*)+0x4ed) [0x5232ad] 11: (OSD::ms_dispatch(Message*)+0xd9) [0x523cf9] 12: (SimpleMessenger::dispatch_entry()+0x8e3) [0x6175f3] 13: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49140c] 14: /lib/libpthread.so.0 [0x7fae8e8a0fc7] 15: (clone()+0x6d) [0x7fae8d51164d] *** Caught signal (Aborted) ** in thread 0x49cec950 1: /bsd/bin/cosd [0x63dce2] 2: /lib/libpthread.so.0 [0x7fae8e8a8a80] 3: (gsignal()+0x35) [0x7fae8d473ed5] 4: (abort()+0x183) [0x7fae8d4753f3] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fae8dcfbdc5] 6: /usr/lib/libstdc++.so.6 [0x7fae8dcfa166] 7: /usr/lib/libstdc++.so.6 [0x7fae8dcfa193] 8: /usr/lib/libstdc++.so.6 [0x7fae8dcfa28e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x37d) [0x6067dd] 10: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xb6) [0x562116] 11: (boost::statechart::detail::inner_constructor<boost::mpl::l_item<mpl_::long_<1l>, PG::RecoveryState::Crashed, boost::mpl::l_end>, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator> >::construct(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>* const&, boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>&)+0x26) [0x59fb86] 12: (boost::statechart::simple_state<PG::RecoveryState::Started, PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Start, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc8) [0x5a0448] 13: (boost::statechart::simple_state<PG::RecoveryState::Primary, PG::RecoveryState::Started, PG::RecoveryState::Peering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xf9) [0x5a2e49] 14: (boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x99) [0x5a3f19] 15: (boost::statechart::simple_state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x9e) [0x5a516e] 16: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x5a1dab] 17: (PG::RecoveryState::handle_log(int, MOSDPGLog*, PG::RecoveryCtx*)+0x14a) [0x577c0a] 18: (OSD::handle_pg_log(MOSDPGLog*)+0x344) [0x51a064] 19: (OSD::_dispatch(Message*)+0x4ed) [0x5232ad] 20: (OSD::ms_dispatch(Message*)+0xd9) [0x523cf9] 21: (SimpleMessenger::dispatch_entry()+0x8e3) [0x6175f3] 22: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x49140c] 23: /lib/libpthread.so.0 [0x7fae8e8a0fc7] 24: (clone()+0x6d) [0x7fae8d51164d]