One of my mons has been having a rough time for the last day or so. It started with a crash and restart I didn't notice about a day ago and now it won't start. Where it crashes has changed over time but it is now stuck on the last error below. I've tried to get some more information out of it with debug logging and gdb but I haven't seen anything that makes the root cause of this obvious.
Right now it is crashing at line 103 in https://github.com/ceph/ceph/blob/mimic/src/mon/LogMonitor.cc#L103. This is part of the mon preinit step. Best that I can tell right now is that it is having a problem with a map version. I'm considering rebuilding the mon's store though I don't see any clear signs of corruption.
It bails at assert(err == 0);
// walk through incrementals while (version > summary.version) { bufferlist bl; int err = get_version(summary.version+1, bl); assert(err == 0); assert(bl.length()); Has anyone seen similar or have any ideas? ceph 13.2.8 Thanks! KevinThe first crash/restart
Jan 14 20:47:11 sephmon5 ceph-mon: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/Monitor.cc: In function 'bool Monitor::_scrub(ScrubResult*, std::pair<std::basic_string<char>, std::basic_string<char> >*, int*)' thread 7f5b54680700 time 2020-01-14 20:47:11.618368
Jan 14 20:47:11 sephmon5 ceph-mon: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/Monitor.cc: 5225: FAILED assert(err == 0)
Jan 14 20:47:11 sephmon5 ceph-mon: ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
Jan 14 20:47:11 sephmon5 ceph-mon: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x7f5b6440b87b]
Jan 14 20:47:11 sephmon5 ceph-mon: 2: (()+0x26fa07) [0x7f5b6440ba07]
Jan 14 20:47:11 sephmon5 ceph-mon: 3: (Monitor::_scrub(ScrubResult*, std::pair<std::string, std::string>*, int*)+0xfa6) [0x55c3230a1896]
Jan 14 20:47:11 sephmon5 ceph-mon: 4: (Monitor::handle_scrub(boost::intrusive_ptr<MonOpRequest>)+0x25e) [0x55c3230aa01e]
Jan 14 20:47:11 sephmon5 ceph-mon: 5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xcaf) [0x55c3230c73ff]
Jan 14 20:47:11 sephmon5 ceph-mon: 6: (Monitor::_ms_dispatch(Message*)+0x732) [0x55c3230c8152]
Jan 14 20:47:11 sephmon5 ceph-mon: 7: (Monitor::ms_dispatch(Message*)+0x23) [0x55c3230edcc3]
Jan 14 20:47:11 sephmon5 ceph-mon: 8: (DispatchQueue::entry()+0xb7a) [0x7f5b644ca24a]
Jan 14 20:47:11 sephmon5 ceph-mon: 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f5b645684bd]
Jan 14 20:47:11 sephmon5 ceph-mon: 10: (()+0x7e65) [0x7f5b63749e65]
Jan 14 20:47:11 sephmon5 ceph-mon: 11: (clone()+0x6d) [0x7f5b6025d88d]
Then a couple more crashes/restarts about 11 hours later with this trace
-10001> 2020-01-15 09:36:35.796 7f9600fc7700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/LogMonitor.cc: In function 'void LogMonitor::_create_sub_incremental(MLog*, int, version_t)' thread 7f9600fc7700 time 2020-01-15 09:36:35.796354
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/LogMonitor.cc: 673: FAILED assert(err == 0)
ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x7f9610d5287b]
2: (()+0x26fa07) [0x7f9610d52a07]
3: (LogMonitor::_create_sub_incremental(MLog*, int, unsigned long)+0xb54) [0x55aeb09e2f94]
4: (LogMonitor::check_sub(Subscription*)+0x506) [0x55aeb09e3806]
5: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0x10ed) [0x55aeb098973d]
6: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x3cd) [0x55aeb09b0b1d]
7: (Monitor::_ms_dispatch(Message*)+0x732) [0x55aeb09b2152]
8: (Monitor::ms_dispatch(Message*)+0x23) [0x55aeb09d7cc3]
9: (DispatchQueue::entry()+0xb7a) [0x7f9610e1124a]
10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f9610eaf4bd]
11: (()+0x7e65) [0x7f9610090e65]
12: (clone()+0x6d) [0x7f960cba488d]
-10001> 2020-01-15 09:36:35.797 7f95fffc5700 1 -- 10.1.9.205:6789/0 >> - conn(0x55aec5dd0600 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=47 -
-10001> 2020-01-15 09:36:35.798 7f9600fc7700 -1 *** Caught signal (Aborted) **
in thread 7f9600fc7700 thread_name:ms_dispatch
And now the mon no longer starts with this trace
-261> 2020-01-15 16:36:46.084 7f0946674a00 10 mon.sephmon5@-1(probing).paxosservice(logm 0..86521000) refresh
-261> 2020-01-15 16:36:46.084 7f0946674a00 10 mon.sephmon5@-1(probing).log v86521000 update_from_paxos
-261> 2020-01-15 16:36:46.084 7f0946674a00 10 mon.sephmon5@-1(probing).log v86521000 update_from_paxos version 86521000 summary v 0
-261> 2020-01-15 16:36:46.084 7f0946674a00 10 mon.sephmon5@-1(probing).log v86521000 update_from_paxos latest full 86520999
-261> 2020-01-15 16:36:46.084 7f0946674a00 7 mon.sephmon5@-1(probing).log v86521000 update_from_paxos loading summary e86520999
-261> 2020-01-15 16:36:46.084 7f0946674a00 7 mon.sephmon5@-1(probing).log v86521000 update_from_paxos loaded summary e86520999
-261> 2020-01-15 16:36:46.085 7f0946674a00 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/LogMonitor.cc: In function 'virtual void LogMonitor::update_from_paxos(bool*)' thread 7f0946674a00 time 2020-01-15 16:36:46.084573
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/LogMonitor.cc: 103: FAILED assert(err == 0)
ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x7f093da5087b]
2: (()+0x26fa07) [0x7f093da50a07]
3: (LogMonitor::update_from_paxos(bool*)+0x19d3) [0x55e5a0378de3]
4: (PaxosService::refresh(bool*)+0x22b) [0x55e5a0427e1b]
5: (Monitor::refresh_from_paxos(bool*)+0xd3) [0x55e5a0325943]
6: (Monitor::preinit()+0xac4) [0x55e5a0326784]
7: (main()+0x2611) [0x55e5a021b2b1]
8: (__libc_start_main()+0xf5) [0x7f09397c6505]
9: (()+0x24ad40) [0x55e5a02fad40]
-261> 2020-01-15 16:36:46.086 7f0946674a00 -1 *** Caught signal (Aborted) **
in thread 7f0946674a00 thread_name:ceph-mon
-- Kevin Hrpcek NASA VIIRS Atmosphere SIPS Space Science & Engineering Center University of Wisconsin-Madison
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com