Hi, more and more OSDs now crash all the time and I've lost more OSDs than my replication allows, all my data is currently down or inactive. Can somebody help me fix those asserts and get them up again (so i can start my distaster recovery backup)? $ sudo /usr/bin/ceph-osd -f --cluster ceph --id 10 --setuser ceph --setgroup ceph 2022-11-02T22:02:43.482+0100 ffffb8198040 -1 Falling back to public interface 2022-11-02T22:03:33.301+0100 ffffb8198040 -1 osd.10 30473 log_to_monitors true 2022-11-02T22:03:34.484+0100 ffffabdcbb00 -1 osd.10 30473 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory /mnt/ceph/src/ceph-17.2.4/src/osd/OSD.cc: In function 'void OSD::do_recovery(PG*, epoch_t, uint64_t, ThreadPool::TPHandle&)' thread ffff9733bb00 time 2022-11-02T22:03:37.276509+0100 /mnt/ceph/src/ceph-17.2.4/src/osd/OSD.cc: 9676: FAILED ceph_assert(started <= reserved_pushes) ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x134) [0xaaaabcf5f74c] 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0xaaaabcf5f8c8] 3: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x4f4) [0xaaaabcfee554] 4: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x28) [0xaaaabd276398] 5: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x574) [0xaaaabcfeebb4] 6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x308) [0xaaaabd6687e8] 7: (ShardedThreadPool::WorkThreadSharded::entry()+0x18) [0xaaaabd66afe8] 8: /usr/lib/libc.so.6(+0x80aec) [0xffffb6cc0aec] 9: /usr/lib/libc.so.6(+0xea5dc) [0xffffb6d2a5dc] 2022-11-02T22:03:37.280+0100 ffff9733bb00 -1 /mnt/ceph/src/ceph-17.2.4/src/osd/OSD.cc: In function 'void OSD::do_recovery(PG*, epoch_t, uint64_t, ThreadPool::TPHandle&)' thread ffff9733bb00 time 2022-11-02T22:03:37.276509+0100 /mnt/ceph/src/ceph-17.2.4/src/osd/OSD.cc: 9676: FAILED ceph_assert(started <= reserved_pushes) ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x134) [0xaaaabcf5f74c] 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0xaaaabcf5f8c8] 3: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x4f4) [0xaaaabcfee554] 4: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x28) [0xaaaabd276398] 5: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x574) [0xaaaabcfeebb4] 6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x308) [0xaaaabd6687e8] 7: (ShardedThreadPool::WorkThreadSharded::entry()+0x18) [0xaaaabd66afe8] 8: /usr/lib/libc.so.6(+0x80aec) [0xffffb6cc0aec] 9: /usr/lib/libc.so.6(+0xea5dc) [0xffffb6d2a5dc] *** Caught signal (Aborted) ** in thread ffff9733bb00 thread_name:tp_osd_tp ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy (stable) 1: __kernel_rt_sigreturn() 2: /usr/lib/libc.so.6(+0x82790) [0xffffb6cc2790] 3: raise() 4: abort() 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0xaaaabcf5f7a0] 6: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0xaaaabcf5f8c8] 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x4f4) [0xaaaabcfee554] 8: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x28) [0xaaaabd276398] 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x574) [0xaaaabcfeebb4] 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x308) [0xaaaabd6687e8] 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x18) [0xaaaabd66afe8] 12: /usr/lib/libc.so.6(+0x80aec) [0xffffb6cc0aec] 13: /usr/lib/libc.so.6(+0xea5dc) [0xffffb6d2a5dc] 2022-11-02T22:03:37.284+0100 ffff9733bb00 -1 *** Caught signal (Aborted) ** in thread ffff9733bb00 thread_name:tp_osd_tp ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy (stable) 1: __kernel_rt_sigreturn() 2: /usr/lib/libc.so.6(+0x82790) [0xffffb6cc2790] 3: raise() 4: abort() 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0xaaaabcf5f7a0] 6: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0xaaaabcf5f8c8] 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x4f4) [0xaaaabcfee554] 8: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x28) [0xaaaabd276398] 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x574) [0xaaaabcfeebb4] 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x308) [0xaaaabd6687e8] 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x18) [0xaaaabd66afe8] 12: /usr/lib/libc.so.6(+0x80aec) [0xffffb6cc0aec] 13: /usr/lib/libc.so.6(+0xea5dc) [0xffffb6d2a5dc] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. -9999> 2022-11-02T22:03:34.484+0100 ffffabdcbb00 -1 osd.10 30473 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory -9998> 2022-11-02T22:03:37.280+0100 ffff9733bb00 -1 /mnt/ceph/src/ceph-17.2.4/src/osd/OSD.cc: In function 'void OSD::do_recovery(PG*, epoch_t, uint64_t, ThreadPool::TPHandle&)' thread ffff9733bb00 time 2022-11-02T22:03:37.276509+0100 /mnt/ceph/src/ceph-17.2.4/src/osd/OSD.cc: 9676: FAILED ceph_assert(started <= reserved_pushes) ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x134) [0xaaaabcf5f74c] 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0xaaaabcf5f8c8] 3: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x4f4) [0xaaaabcfee554] 4: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x28) [0xaaaabd276398] 5: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x574) [0xaaaabcfeebb4] 6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x308) [0xaaaabd6687e8] 7: (ShardedThreadPool::WorkThreadSharded::entry()+0x18) [0xaaaabd66afe8] 8: /usr/lib/libc.so.6(+0x80aec) [0xffffb6cc0aec] 9: /usr/lib/libc.so.6(+0xea5dc) [0xffffb6d2a5dc] -9997> 2022-11-02T22:03:37.284+0100 ffff9733bb00 -1 *** Caught signal (Aborted) ** in thread ffff9733bb00 thread_name:tp_osd_tp ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy (stable) 1: __kernel_rt_sigreturn() 2: /usr/lib/libc.so.6(+0x82790) [0xffffb6cc2790] 3: raise() 4: abort() 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0xaaaabcf5f7a0] 6: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0xaaaabcf5f8c8] 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x4f4) [0xaaaabcfee554] 8: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x28) [0xaaaabd276398] 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x574) [0xaaaabcfeebb4] 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x308) [0xaaaabd6687e8] 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x18) [0xaaaabd66afe8] 12: /usr/lib/libc.so.6(+0x80aec) [0xffffb6cc0aec] 13: /usr/lib/libc.so.6(+0xea5dc) [0xffffb6d2a5dc] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. -9999> 2022-11-02T22:03:34.484+0100 ffffabdcbb00 -1 osd.10 30473 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory -9998> 2022-11-02T22:03:37.280+0100 ffff9733bb00 -1 /mnt/ceph/src/ceph-17.2.4/src/osd/OSD.cc: In function 'void OSD::do_recovery(PG*, epoch_t, uint64_t, ThreadPool::TPHandle&)' thread ffff9733bb00 time 2022-11-02T22:03:37.276509+0100 /mnt/ceph/src/ceph-17.2.4/src/osd/OSD.cc: 9676: FAILED ceph_assert(started <= reserved_pushes) ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x134) [0xaaaabcf5f74c] 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0xaaaabcf5f8c8] 3: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x4f4) [0xaaaabcfee554] 4: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x28) [0xaaaabd276398] 5: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x574) [0xaaaabcfeebb4] 6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x308) [0xaaaabd6687e8] 7: (ShardedThreadPool::WorkThreadSharded::entry()+0x18) [0xaaaabd66afe8] 8: /usr/lib/libc.so.6(+0x80aec) [0xffffb6cc0aec] 9: /usr/lib/libc.so.6(+0xea5dc) [0xffffb6d2a5dc] -9997> 2022-11-02T22:03:37.284+0100 ffff9733bb00 -1 *** Caught signal (Aborted) ** in thread ffff9733bb00 thread_name:tp_osd_tp ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy (stable) 1: __kernel_rt_sigreturn() 2: /usr/lib/libc.so.6(+0x82790) [0xffffb6cc2790] 3: raise() 4: abort() 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0xaaaabcf5f7a0] 6: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0xaaaabcf5f8c8] 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x4f4) [0xaaaabcfee554] 8: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x28) [0xaaaabd276398] 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x574) [0xaaaabcfeebb4] 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x308) [0xaaaabd6687e8] 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x18) [0xaaaabd66afe8] 12: /usr/lib/libc.so.6(+0x80aec) [0xffffb6cc0aec] 13: /usr/lib/libc.so.6(+0xea5dc) [0xffffb6d2a5dc] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx