Re: OSD Crash When Upgrading from Jewel to Luminous?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 17 Aug 2018 17:01:09 -0400

Do you have more logs that indicate what state machine event the crashing OSDs received? This obviously shouldn't have happened, but it's a plausible failure mode, especially if it's a relatively rare combination of events.-Greg

On Fri, Aug 17, 2018 at 4:49 PM Kenneth Van Alstyne <kvanalstyne@xxxxxxxxxxxxxxx> wrote:
Hello all:

        I ran into an issue recently with one of my clusters when upgrading from 10.2.10 to 12.2.7.  I have previously tested the upgrade in a lab and upgraded one of our five production clusters with no issues.  On the second cluster, however, I ran into an issue where all OSDs that were NOT running Luminous yet (which was about 40% of the cluster at the time) all crashed with the same backtrace, which I have pasted below:

===

     0> 2018-08-13 17:35:13.160849 7f145c9ec700 -1 osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f145c9ec700 time 2018-08-13 17:35:13.157319

osd/PG.cc: 5860: FAILED assert(0 == "we got a bad state machine event")

 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0x55b9bf08614f]

 2: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xc4) [0x55b9bea62db4]

 3: (()+0x447366) [0x55b9bea9a366]

 4: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x2f7) [0x55b9beac8b77]

 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x55b9beaab5bb]

 6: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x384) [0x55b9bea7db14]

 7: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x263) [0x55b9be9d1723]

 8: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x2a) [0x55b9bea1274a]

 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb0) [0x55b9bf076d40]

 10: (ThreadPool::WorkThread::entry()+0x10) [0x55b9bf077ef0]

 11: (()+0x7507) [0x7f14e2c96507]

 12: (clone()+0x3f) [0x7f14e0ca214f]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

===

Once I restarted the impacted OSDs, which brought them up to 12.2.7, everything recovered just fine and the cluster is healthy.  The only rub is that losing that many OSDs simultaneously caused a significant I/O disruption to the production servers for several minutes while I brought up the remaining OSDs.  I have been trying to duplicate this issue in a lab again before continuing the upgrades on the other three clusters, but am coming up short.  Has anyone seen anything like this and am I missing something obvious?

Given how quickly the issue happened and the fact that I’m having a hard time reproducing this issue, I am limited in the amount of logging and debug information I have available, unfortunately.  If it helps, all ceph-mon, ceph-mds, radosgw, and ceph-mgr daemons were running 12.2.7, while 30 of the 50 total ceph-osd daemons were also on 12.2.7 when the remaining 20 ceph-osd daemons (on 10.2.10) crashed.

Thanks,

--

Kenneth Van Alstyne

Systems Architect

Knight Point Systems, LLC

Service-Disabled Veteran-Owned Business

1775 Wiehle Avenue Suite 101 | Reston, VA 20190

c: 228-547-8045 f: 571-266-3106

www.knightpoint.com 

DHS EAGLE II Prime Contractor: FC1 SDVOSB Track

GSA Schedule 70 SDVOSB: GS-35F-0646S

GSA MOBIS Schedule: GS-10F-0404Y

ISO 20000 / ISO 27001 / CMMI Level 3

Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, copy, use, disclosure, or distribution is STRICTLY prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com