Hi We are facing error with OSD crash after reboot of the server where it is installed We rebooted servers in our ceph cluster for a patching and after rebooting two OSD where crashing One of them finally recovered but the other is still down Cluster is currently rebalancing objects : # ceph status cluster: id: 62d303dc-e46b-4863-93b3-7ee995594dd1 health: HEALTH_ERR clients are using insecure global_id reclaim mons are allowing insecure global_id reclaim 1 filesystem is degraded 1 filesystem is offline 1 mds daemon damaged mons ac,ae,v are low on available space 4/1100134 objects unfound (0.000%) 1 osds down 1 host (1 osds) down 1 nearfull osd(s) Reduced data availability: 9 pgs inactive, 8 pgs down, 1 pg incomplete Possible data damage: 4 pgs recovery_unfound Degraded data redundancy: 12/2204629 objects degraded (0.001%), 5 pgs degraded, 14 pgs undersized 13 pool(s) nearfull 236 daemons have recently crashed services: mon: 3 daemons, quorum v,ac,ae (age 13h) mgr: a(active, since 24m) mds: myfs:0/1 2 up:standby, 1 damaged osd: 7 osds: 6 up (since 68s), 7 in (since 18m); 131 remapped pgs rgw: 1 daemon active (harbor.object.store.a) task status: data: pools: 13 pools, 337 pgs objects: 1.10M objects, 2.3 TiB usage: 5.6 TiB used, 3.0 TiB / 8.5 TiB avail pgs: 2.671% pgs not active 12/2204629 objects degraded (0.001%) 616501/2204629 objects misplaced (27.964%) 4/1100134 objects unfound (0.000%) 179 active+clean 78 active+clean+remapped 52 active+remapped+backfill_wait 13 active+undersized 8 down 4 active+recovery_unfound+degraded 1 active+undersized+degraded 1 incomplete 1 active+remapped+backfilling io: client: 8.0 KiB/s wr, 0 op/s rd, 0 op/s wr recovery: 1.4 MiB/s, 7 objects/s These are the last lines from the OSD crash log. We are no sure why this is crashing :( 4(1) r=-1 lpr=69148 pi=[69137,69148)/2 crt=7010'401 mlcod 0'0 remapped NOTIFY m=276 mbc={}] exit Started/Stray 1.009572 7 0.001960 -19> 2023-07-15T06:58:40.723+0000 7f444a0b0700 3 osd.0 69149 handle_osd_map epochs [69149,69149], i have 69149, src has [65617,69149] -18> 2023-07-15T06:58:40.723+0000 7f444109e700 5 osd.0 pg_epoch: 69149 pg[10.14( v 9188'79 (0'0,9188'79] local-lis/les=69148/69149 n=4 ec=8825/8825 lis/c=69148/69143 les/c/f=69149/69144/0 sis=69148) [0,3] r=0 lpr=69148 pi=[66972,69148)/5 crt=9188'79 lcod 0'0 mlcod 0'0 active mbc={}] enter Started/Primary/Active/Clean -17> 2023-07-15T06:58:40.723+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[8.7s0( v 7010'401 lc 0'0 (0'0,7010'401] local-lis/les=0/0 n=151 ec=77/77 lis/c=69144/69144 les/c/f=69145/69145/0 sis=69148) [0,4,6]/[NONE,4,6]p4(1) r=-1 lpr=69148 pi=[69137,69148)/2 crt=7010'401 mlcod 0'0 remapped NOTIFY m=276 mbc={}] enter Started/ReplicaActive -16> 2023-07-15T06:58:40.723+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[8.7s0( v 7010'401 lc 0'0 (0'0,7010'401] local-lis/les=0/0 n=151 ec=77/77 lis/c=69144/69144 les/c/f=69145/69145/0 sis=69148) [0,4,6]/[NONE,4,6]p4(1) r=-1 lpr=69148 pi=[69137,69148)/2 crt=7010'401 mlcod 0'0 remapped NOTIFY m=276 mbc={}] enter Started/ReplicaActive/RepNotRecovering -15> 2023-07-15T06:58:40.723+0000 7f444a0b0700 3 osd.0 69149 handle_osd_map epochs [69149,69149], i have 69149, src has [65617,69149] -14> 2023-07-15T06:58:40.724+0000 7f444089d700 5 osd.0 pg_epoch: 69149 pg[10.1f( v 9092'7 lc 8831'4 (0'0,9092'7] local-lis/les=69148/69149 n=2 ec=8825/8825 lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [6,0] r=1 lpr=69148 pi=[69137,69148)/2 luod=0'0 crt=9092'7 lcod 0'0 mlcod 0'0 active m=2 mbc={}] exit Started/ReplicaActive/RepNotRecovering 0.001805 4 0.000066 -13> 2023-07-15T06:58:40.724+0000 7f444089d700 5 osd.0 pg_epoch: 69149 pg[10.1f( v 9092'7 lc 8831'4 (0'0,9092'7] local-lis/les=69148/69149 n=2 ec=8825/8825 lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [6,0] r=1 lpr=69148 pi=[69137,69148)/2 luod=0'0 crt=9092'7 lcod 0'0 mlcod 0'0 active m=2 mbc={}] enter Started/ReplicaActive/RepWaitRecoveryReserved -12> 2023-07-15T06:58:40.724+0000 7f444089d700 5 osd.0 pg_epoch: 69149 pg[10.1f( v 9092'7 lc 8831'4 (0'0,9092'7] local-lis/les=69148/69149 n=2 ec=8825/8825 lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [6,0] r=1 lpr=69148 pi=[69137,69148)/2 luod=0'0 crt=9092'7 lcod 0'0 mlcod 0'0 active m=2 mbc={}] exit Started/ReplicaActive/RepWaitRecoveryReserved 0.000043 1 0.000053 -11> 2023-07-15T06:58:40.724+0000 7f444089d700 5 osd.0 pg_epoch: 69149 pg[10.1f( v 9092'7 lc 8831'4 (0'0,9092'7] local-lis/les=69148/69149 n=2 ec=8825/8825 lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [6,0] r=1 lpr=69148 pi=[69137,69148)/2 luod=0'0 crt=9092'7 lcod 0'0 mlcod 0'0 active m=2 mbc={}] enter Started/ReplicaActive/RepRecovering -10> 2023-07-15T06:58:40.725+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[7.2( v 268'5 (79'3,268'5] lb MIN local-lis/les=69140/69141 n=0 ec=71/68 lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [0,2]/[2,5] r=-1 lpr=69147 pi=[69137,69147)/1 luod=0'0 crt=268'5 lcod 0'0 mlcod 0'0 active+remapped mbc={}] exit Started/ReplicaActive/RepNotRecovering 1.010921 6 0.000077 -9> 2023-07-15T06:58:40.725+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[7.2( v 268'5 (79'3,268'5] lb MIN local-lis/les=69140/69141 n=0 ec=71/68 lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [0,2]/[2,5] r=-1 lpr=69147 pi=[69137,69147)/1 luod=0'0 crt=268'5 lcod 0'0 mlcod 0'0 active+remapped mbc={}] enter Started/ReplicaActive/RepWaitBackfillReserved -8> 2023-07-15T06:58:40.726+0000 7f444a0b0700 3 osd.0 69149 handle_osd_map epochs [69149,69149], i have 69149, src has [65617,69149] -7> 2023-07-15T06:58:40.726+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[12.c( v 65902'349780 (65642'347213,65902'349780] lb 12:30005f02:::1000060a6a4.00000000:head local-lis/les=67222/67223 n=1903 ec=9097/9097 lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [0,5]/[5,1] r=-1 lpr=69147 pi=[67170,69147)/2 luod=0'0 crt=65902'349780 mlcod 0'0 active+remapped mbc={}] exit Started/ReplicaActive/RepRecovering 0.996833 5 0.000101 -6> 2023-07-15T06:58:40.726+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[12.c( v 65902'349780 (65642'347213,65902'349780] lb 12:30005f02:::1000060a6a4.00000000:head local-lis/les=67222/67223 n=1903 ec=9097/9097 lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [0,5]/[5,1] r=-1 lpr=69147 pi=[67170,69147)/2 luod=0'0 crt=65902'349780 mlcod 0'0 active+remapped mbc={}] enter Started/ReplicaActive/RepNotRecovering -5> 2023-07-15T06:58:40.726+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[8.7s0( v 7010'401 lc 0'0 (0'0,7010'401] local-lis/les=0/0 n=151 ec=77/77 lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [0,4,6]/[NONE,4,6]p4(1) r=-1 lpr=69148 pi=[69137,69148)/2 luod=0'0 crt=7010'401 mlcod 0'0 active+remapped m=276 mbc={}] exit Started/ReplicaActive/RepNotRecovering 0.002781 3 0.000076 -4> 2023-07-15T06:58:40.726+0000 7f444009c700 5 osd.0 pg_epoch: 69149 pg[8.7s0( v 7010'401 lc 0'0 (0'0,7010'401] local-lis/les=0/0 n=151 ec=77/77 lis/c=69148/69144 les/c/f=69149/69145/0 sis=69148) [0,4,6]/[NONE,4,6]p4(1) r=-1 lpr=69148 pi=[69137,69148)/2 luod=0'0 crt=7010'401 mlcod 0'0 active+remapped m=276 mbc={}] enter Started/ReplicaActive/RepWaitRecoveryReserved -3> 2023-07-15T06:58:40.727+0000 7f444089d700 5 osd.0 pg_epoch: 69149 pg[12.1( v 65904'388229 (65626'385027,65904'388229] lb MIN local-lis/les=67170/67171 n=1876 ec=9097/9097 lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [5,0]/[5,3] r=-1 lpr=69147 pi=[67170,69147)/3 luod=0'0 crt=65904'388229 lcod 0'0 mlcod 0'0 active+remapped mbc={}] exit Started/ReplicaActive/RepNotRecovering 1.009973 6 0.000367 -2> 2023-07-15T06:58:40.727+0000 7f444089d700 5 osd.0 pg_epoch: 69149 pg[12.1( v 65904'388229 (65626'385027,65904'388229] lb MIN local-lis/les=67170/67171 n=1876 ec=9097/9097 lis/c=69147/69143 les/c/f=69148/69144/0 sis=69147) [5,0]/[5,3] r=-1 lpr=69147 pi=[67170,69147)/3 luod=0'0 crt=65904'388229 lcod 0'0 mlcod 0'0 active+remapped mbc={}] enter Started/ReplicaActive/RepWaitBackfillReserved -1> 2023-07-15T06:58:40.729+0000 7f444109e700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/osd/PGLog.cc: In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f444109e700 time 2023-07-15T06:58:40.727106+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/osd/PGLog.cc: 369: FAILED ceph_assert(log.head >= olog.tail && olog.head >= log.tail) ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x563620b4dbd8] 2: (()+0x507df2) [0x563620b4ddf2] 3: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1ca1) [0x563620d121f1] 4: (PeeringState::merge_log(ceph::os::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x75) [0x563620e982c5] 5: (PeeringState::Stray::react(MLogRec const&)+0xcc) [0x563620ed308c] 6: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa5) [0x563620efefb5] 7: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5b) [0x563620cf22ab] 8: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1) [0x563620ce48a1] 9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x29c) [0x563620c5bc7c] 10: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x563620e8d906] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x563620c4e92f] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x56362128ef84] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563621291be4] 14: (()+0x814a) [0x7f446112c14a] 15: (clone()+0x43) [0x7f445fe63f23] 0> 2023-07-15T06:58:40.733+0000 7f444109e700 -1 *** Caught signal (Aborted) ** in thread 7f444109e700 thread_name:tp_osd_tp ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable) 1: (()+0x12b20) [0x7f4461136b20] 2: (gsignal()+0x10f) [0x7f445fd9e7ff] 3: (abort()+0x127) [0x7f445fd88c35] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x563620b4dc29] 5: (()+0x507df2) [0x563620b4ddf2] 6: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1ca1) [0x563620d121f1] 7: (PeeringState::merge_log(ceph::os::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x75) [0x563620e982c5] 8: (PeeringState::Stray::react(MLogRec const&)+0xcc) [0x563620ed308c] 9: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa5) [0x563620efefb5] 10: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5b) [0x563620cf22ab] 11: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1) [0x563620ce48a1] 12: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x29c) [0x563620c5bc7c] 13: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x563620e8d906] 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x563620c4e92f] 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x56362128ef84] 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563621291be4] 17: (()+0x814a) [0x7f446112c14a] 18: (clone()+0x43) [0x7f445fe63f23] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_rwl 0/ 5 journaler 0/ 5 objectcacher 0/ 5 immutable_obj_cache 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test -2/-2 (syslog threshold) 99/99 (stderr threshold) --- pthread ID / name mapping for recent threads --- 7f443e899700 / osd_srv_heartbt 7f443f09a700 / tp_osd_tp 7f443f89b700 / tp_osd_tp 7f444009c700 / tp_osd_tp 7f444089d700 / tp_osd_tp 7f444109e700 / tp_osd_tp 7f444a0b0700 / ms_dispatch 7f444b0b2700 / rocksdb:dump_st 7f444beae700 / fn_anonymous 7f444ceb0700 / cfin 7f444e28c700 / safe_timer 7f444f28e700 / ms_dispatch 7f4451eba700 / bstore_mempool 7f44570ca700 / fn_anonymous 7f44588cd700 / safe_timer 7f445a141700 / safe_timer 7f445a942700 / signal_handler 7f445b944700 / admin_socket 7f445c145700 / service 7f445c946700 / msgr-worker-2 7f445d147700 / msgr-worker-1 7f445d948700 / msgr-worker-0 7f44633cef00 / ceph-osd max_recent 10000 max_new 1000 log_file /var/lib/ceph/crash/2023-07-15T06:58:40.734237Z_21e01469-d6d6-4be4-b913-f9cc55a7ab22/log --- end dump of recent events --- I appreciate any help to point us to a possible troubleshooting path :) thanks a lot and kind regards, _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx