I had to use rocksdb repair tool before because the rocksdb files got corrupted, for another reason (another bug possibly). Maybe that is why now it crash loops, although it ran fine for a day.
What is meant with "turn it off and rebuild from remainder"?
Am Samstag, 5. Oktober 2019, 02:03:44 OESZ hat Gregory Farnum <gfarnum@xxxxxxxxxx> Folgendes geschrieben:
Hmm, that assert means the monitor tried to grab an OSDMap it had on
disk but it didn't work. (In particular, a "pinned" full map which we
kept around after trimming the others to save on disk space.)
That *could* be a bug where we didn't have the pinned map and should
have (or incorrectly thought we should have), but this code was in
Mimic as well as Nautilus and I haven't seen similar reports. So it
could also mean that something bad happened to the monitor's disk or
Rocksdb store. Can you turn it off and rebuild from the remainder, or
do they all exhibit this bug?
On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
<danjou.philippe@xxxxxxxx> wrote:
>
> Hi,
> our mon is acting up all of a sudden and dying in crash loop with the following:
>
>
> 2019-10-04 14:00:24.339583 lease_expire=0.000000 has v0 lc 4549352
> -3> 2019-10-04 14:00:24.335 7f6e5d461700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 4548623..4549352) is_readable = 1 - now=2019-10-04 14:00:24.339620 lease_expire=0.000000 has v0 lc 4549352
> -2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map closest pinned map ver 252615 not available! error: (2) No such file or directory
> -1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 /build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' thread 7f6e5d461700 time 2019-10-04 14:00:24.347580
> /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 0)
>
> ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f6e68eb064e]
> 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f6e68eb0829]
> 3: (OSDMonitor::get_full_from_pinned_map(unsigned long, ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> 4: (OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
> 5: (OSDMonitor::encode_trim_extra(std::shared_ptr<MonitorDBStore::Transaction>, unsigned long)+0x8c) [0x717c3c]
> 6: (PaxosService::maybe_trim()+0x473) [0x707443]
> 7: (Monitor::tick()+0xa9) [0x5ecf39]
> 8: (C_MonContext::finish(int)+0x39) [0x5c3f29]
> 9: (Context::complete(int)+0x9) [0x6070d9]
> 10: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
> 11: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
> 12: (()+0x76ba) [0x7f6e67cab6ba]
> 13: (clone()+0x6d) [0x7f6e674d441d]
>
> 0> 2019-10-04 14:00:24.347 7f6e5d461700 -1 *** Caught signal (Aborted) **
> in thread 7f6e5d461700 thread_name:safe_timer
>
> ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
> 1: (()+0x11390) [0x7f6e67cb5390]
> 2: (gsignal()+0x38) [0x7f6e67402428]
> 3: (abort()+0x16a) [0x7f6e6740402a]
> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x7f6e68eb069f]
> 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f6e68eb0829]
> 6: (OSDMonitor::get_full_from_pinned_map(unsigned long, ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> 7: (OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
> 8: (OSDMonitor::encode_trim_extra(std::shared_ptr<MonitorDBStore::Transaction>, unsigned long)+0x8c) [0x717c3c]
> 9: (PaxosService::maybe_trim()+0x473) [0x707443]
> 10: (Monitor::tick()+0xa9) [0x5ecf39]
> 11: (C_MonContext::finish(int)+0x39) [0x5c3f29]
> 12: (Context::complete(int)+0x9) [0x6070d9]
> 13: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
> 14: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
> 15: (()+0x76ba) [0x7f6e67cab6ba]
> 16: (clone()+0x6d) [0x7f6e674d441d]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
>
> This was running fine for 2months now, it's a crashed cluster that is in recovery.
>
> Any suggestions?
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
disk but it didn't work. (In particular, a "pinned" full map which we
kept around after trimming the others to save on disk space.)
That *could* be a bug where we didn't have the pinned map and should
have (or incorrectly thought we should have), but this code was in
Mimic as well as Nautilus and I haven't seen similar reports. So it
could also mean that something bad happened to the monitor's disk or
Rocksdb store. Can you turn it off and rebuild from the remainder, or
do they all exhibit this bug?
On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
<danjou.philippe@xxxxxxxx> wrote:
>
> Hi,
> our mon is acting up all of a sudden and dying in crash loop with the following:
>
>
> 2019-10-04 14:00:24.339583 lease_expire=0.000000 has v0 lc 4549352
> -3> 2019-10-04 14:00:24.335 7f6e5d461700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 4548623..4549352) is_readable = 1 - now=2019-10-04 14:00:24.339620 lease_expire=0.000000 has v0 lc 4549352
> -2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map closest pinned map ver 252615 not available! error: (2) No such file or directory
> -1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 /build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' thread 7f6e5d461700 time 2019-10-04 14:00:24.347580
> /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 0)
>
> ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f6e68eb064e]
> 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f6e68eb0829]
> 3: (OSDMonitor::get_full_from_pinned_map(unsigned long, ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> 4: (OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
> 5: (OSDMonitor::encode_trim_extra(std::shared_ptr<MonitorDBStore::Transaction>, unsigned long)+0x8c) [0x717c3c]
> 6: (PaxosService::maybe_trim()+0x473) [0x707443]
> 7: (Monitor::tick()+0xa9) [0x5ecf39]
> 8: (C_MonContext::finish(int)+0x39) [0x5c3f29]
> 9: (Context::complete(int)+0x9) [0x6070d9]
> 10: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
> 11: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
> 12: (()+0x76ba) [0x7f6e67cab6ba]
> 13: (clone()+0x6d) [0x7f6e674d441d]
>
> 0> 2019-10-04 14:00:24.347 7f6e5d461700 -1 *** Caught signal (Aborted) **
> in thread 7f6e5d461700 thread_name:safe_timer
>
> ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
> 1: (()+0x11390) [0x7f6e67cb5390]
> 2: (gsignal()+0x38) [0x7f6e67402428]
> 3: (abort()+0x16a) [0x7f6e6740402a]
> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x7f6e68eb069f]
> 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f6e68eb0829]
> 6: (OSDMonitor::get_full_from_pinned_map(unsigned long, ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
> 7: (OSDMonitor::get_version_full(unsigned long, unsigned long, ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
> 8: (OSDMonitor::encode_trim_extra(std::shared_ptr<MonitorDBStore::Transaction>, unsigned long)+0x8c) [0x717c3c]
> 9: (PaxosService::maybe_trim()+0x473) [0x707443]
> 10: (Monitor::tick()+0xa9) [0x5ecf39]
> 11: (C_MonContext::finish(int)+0x39) [0x5c3f29]
> 12: (Context::complete(int)+0x9) [0x6070d9]
> 13: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
> 14: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
> 15: (()+0x76ba) [0x7f6e67cab6ba]
> 16: (clone()+0x6d) [0x7f6e674d441d]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
>
> This was running fine for 2months now, it's a crashed cluster that is in recovery.
>
> Any suggestions?
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com