Can you send the results of "ceph daemon osd.0 status" and maybe do that for a couple of osd ids ? You may need to target ones which are currently running. Respectfully, *Wes Dillingham* wes@xxxxxxxxxxxxxxxxx LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Wed, Aug 11, 2021 at 9:51 AM Amudhan P <amudhan83@xxxxxxxxx> wrote: > Hi, > > Below are the logs in one of the failed OSD. > > Aug 11 16:55:48 bash[27152]: debug -20> 2021-08-11T11:25:47.433+0000 > 7fbf3b819700 3 osd.12 6697 handle_osd_map epochs [6696,6697], i have 6697, > src has [ > Aug 11 16:55:48 bash[27152]: debug -19> 2021-08-11T11:25:47.433+0000 > 7fbf32006700 5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564 > (4460'174466,6312'18356 > Aug 11 16:55:48 bash[27152]: debug -18> 2021-08-11T11:25:47.433+0000 > 7fbf32006700 5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564 > (4460'174466,6312'18356 > Aug 11 16:55:48 bash[27152]: debug -17> 2021-08-11T11:25:47.433+0000 > 7fbf32006700 5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564 > (4460'174466,6312'18356 > Aug 11 16:55:48 bash[27152]: debug -16> 2021-08-11T11:25:47.433+0000 > 7fbf32006700 5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564 > (4460'174466,6312'18356 > Aug 11 16:55:48 bash[27152]: debug -15> 2021-08-11T11:25:47.441+0000 > 7fbf3b819700 3 osd.12 6697 handle_osd_map epochs [6696,6697], i have 6697, > src has [ > Aug 11 16:55:48 bash[27152]: debug -14> 2021-08-11T11:25:47.561+0000 > 7fbf3a817700 2 osd.12 6697 ms_handle_refused con 0x563b53a3cc00 session > 0x563b51aecb > Aug 11 16:55:48 bash[27152]: debug -13> 2021-08-11T11:25:47.561+0000 > 7fbf3a817700 10 monclient: _send_mon_message to mon.strg-node2 at v2: > 10.0.103.2:3300/ > Aug 11 16:55:48 bash[27152]: debug -12> 2021-08-11T11:25:47.565+0000 > 7fbf3b819700 2 osd.12 6697 ms_handle_refused con 0x563b66226000 session 0 > Aug 11 16:55:48 bash[27152]: debug -11> 2021-08-11T11:25:47.581+0000 > 7fbf3b819700 2 osd.12 6697 ms_handle_refused con 0x563b66227c00 session 0 > Aug 11 16:55:48 bash[27152]: debug -10> 2021-08-11T11:25:47.581+0000 > 7fbf4e0ae700 10 monclient: get_auth_request con 0x563b53a4f400 auth_method > 0 > Aug 11 16:55:48 bash[27152]: debug -9> 2021-08-11T11:25:47.581+0000 > 7fbf39815700 2 osd.12 6697 ms_handle_refused con 0x563b53a3c800 session > 0x563b679120 > Aug 11 16:55:48 bash[27152]: debug -8> 2021-08-11T11:25:47.581+0000 > 7fbf39815700 10 monclient: _send_mon_message to mon.strg-node2 at v2: > 10.0.103.2:3300/ > Aug 11 16:55:48 bash[27152]: debug -7> 2021-08-11T11:25:47.581+0000 > 7fbf4f0b0700 10 monclient: get_auth_request con 0x563b6331d000 auth_method > 0 > Aug 11 16:55:48 bash[27152]: debug -6> 2021-08-11T11:25:47.581+0000 > 7fbf4e8af700 10 monclient: get_auth_request con 0x563b53a4f000 auth_method > 0 > Aug 11 16:55:48 bash[27152]: debug -5> 2021-08-11T11:25:47.717+0000 > 7fbf4f0b0700 10 monclient: get_auth_request con 0x563b66226c00 auth_method > 0 > Aug 11 16:55:48 bash[27152]: debug -4> 2021-08-11T11:25:47.789+0000 > 7fbf43623700 5 prioritycache tune_memory target: 1073741824 mapped: > 388874240 unmap > Aug 11 16:55:48 bash[27152]: debug -3> 2021-08-11T11:25:47.925+0000 > 7fbf32807700 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ > Aug 11 16:55:48 bash[27152]: > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZ > Aug 11 16:55:48 bash[27152]: ceph version 15.2.7 > (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable) > Aug 11 16:55:48 bash[27152]: 1: (ceph::__ceph_assert_fail(char const*, > char const*, int, char const*)+0x158) [0x563b46835dbe] > Aug 11 16:55:48 bash[27152]: 2: (()+0x504fd8) [0x563b46835fd8] > Aug 11 16:55:48 bash[27152]: 3: (OSD::do_recovery(PG*, unsigned int, > unsigned long, ThreadPool::TPHandle&)+0x5f5) [0x563b46918c25] > Aug 11 16:55:48 bash[27152]: 4: > (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, > boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x563b46b74 > Aug 11 16:55:48 bash[27152]: 5: (OSD::ShardedOpWQ::_process(unsigned int, > ceph::heartbeat_handle_d*)+0x12ef) [0x563b469364df] > Aug 11 16:55:48 bash[27152]: 6: > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) > [0x563b46f6f224] > Aug 11 16:55:48 bash[27152]: 7: > (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563b46f71e84] > Aug 11 16:55:48 bash[27152]: 8: (()+0x82de) [0x7fbf528952de] > Aug 11 16:55:48 bash[27152]: 9: (clone()+0x43) [0x7fbf515cce83] > Aug 11 16:55:48 bash[27152]: debug -2> 2021-08-11T11:25:47.929+0000 > 7fbf32807700 -1 *** Caught signal (Aborted) ** > Aug 11 16:55:48 bash[27152]: in thread 7fbf32807700 thread_name:tp_osd_tp > Aug 11 16:55:48 bash[27152]: ceph version 15.2.7 > (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable) > Aug 11 16:55:48 bash[27152]: 1: (()+0x12dd0) [0x7fbf5289fdd0] > Aug 11 16:55:48 bash[27152]: 2: (gsignal()+0x10f) [0x7fbf5150870f] > Aug 11 16:55:48 bash[27152]: 3: (abort()+0x127) [0x7fbf514f2b25] > Aug 11 16:55:48 bash[27152]: 4: (ceph::__ceph_assert_fail(char const*, > char const*, int, char const*)+0x1a9) [0x563b46835e0f] > Aug 11 16:55:48 bash[27152]: 5: (()+0x504fd8) [0x563b46835fd8] > Aug 11 16:55:48 bash[27152]: 6: (OSD::do_recovery(PG*, unsigned int, > unsigned long, ThreadPool::TPHandle&)+0x5f5) [0x563b46918c25] > Aug 11 16:55:48 bash[27152]: 7: > (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, > boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x563b46b74 > Aug 11 16:55:48 bash[27152]: 8: (OSD::ShardedOpWQ::_process(unsigned int, > ceph::heartbeat_handle_d*)+0x12ef) [0x563b469364df] > Aug 11 16:55:48 bash[27152]: 9: > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) > [0x563b46f6f224] > Aug 11 16:55:48 bash[27152]: 10: > (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563b46f71e84] > Aug 11 16:55:48 bash[27152]: 11: (()+0x82de) [0x7fbf528952de] > Aug 11 16:55:48 bash[27152]: 12: (clone()+0x43) [0x7fbf515cce83] > Aug 11 16:55:48 bash[27152]: NOTE: a copy of the executable, or `objdump > -rdS <executable>` is needed to interpret this. > Aug 11 16:55:48 bash[27152]: debug -1> 2021-08-11T11:25:48.045+0000 > 7fbf3f9f4700 10 monclient: tick > Aug 11 16:55:48 bash[27152]: debug 0> 2021-08-11T11:25:48.045+0000 > 7fbf3f9f4700 10 monclient: _check_auth_rotating have uptodate secrets (they > expire af > Aug 11 16:55:48 bash[27152]: --- logging levels --- > Aug 11 16:55:48 bash[27152]: 0/ 5 none > Aug 11 16:55:48 bash[27152]: 0/ 1 lockdep > Aug 11 16:55:48 bash[27152]: 0/ 1 context > Aug 11 16:55:48 bash[27152]: 1/ 1 crush > Aug 11 16:55:48 bash[27152]: 1/ 5 mds > Aug 11 16:55:48 bash[27152]: 1/ 5 mds_balancer > Aug 11 16:55:48 bash[27152]: 1/ 5 mds_locker > Aug 11 16:55:48 bash[27152]: 1/ 5 mds_log > Aug 11 16:55:48 bash[27152]: --- pthread ID / name mapping for recent > threads --- > Aug 11 16:55:48 bash[27152]: 7fbf30002700 / osd_srv_heartbt > Aug 11 16:55:48 bash[27152]: 7fbf30803700 / tp_osd_tp > Aug 11 16:55:48 bash[27152]: 7fbf31004700 / tp_osd_tp > Aug 11 16:55:48 bash[27152]: 7fbf31805700 / tp_osd_tp > Aug 11 16:55:48 bash[27152]: 7fbf32006700 / tp_osd_tp > Aug 11 16:55:48 bash[27152]: 7fbf32807700 / tp_osd_tp > Aug 11 16:55:48 bash[27152]: 7fbf39815700 / ms_dispatch > Aug 11 16:55:48 bash[27152]: 7fbf3a817700 / ms_dispatch > Aug 11 16:55:48 bash[27152]: 7fbf3b819700 / ms_dispatch > Aug 11 16:55:48 bash[27152]: 7fbf3c81b700 / rocksdb:dump_st > Aug 11 16:55:48 bash[27152]: 7fbf3d617700 / fn_anonymous > Aug 11 16:55:48 bash[27152]: 7fbf3e619700 / cfin > Aug 11 16:55:48 bash[27152]: 7fbf3f9f4700 / safe_timer > Aug 11 16:55:48 bash[27152]: 7fbf409f6700 / ms_dispatch > Aug 11 16:55:48 bash[27152]: 7fbf43623700 / bstore_mempool > Aug 11 16:55:48 bash[27152]: 7fbf48833700 / fn_anonymous > Aug 11 16:55:48 bash[27152]: 7fbf4a036700 / safe_timer > Aug 11 16:55:48 bash[27152]: 7fbf4b8a9700 / safe_timer > Aug 11 16:55:48 bash[27152]: 7fbf4c0aa700 / signal_handler > Aug 11 16:55:48 bash[27152]: 7fbf4d0ac700 / admin_socket > Aug 11 16:55:48 bash[27152]: 7fbf4d8ad700 / service > Aug 11 16:55:48 bash[27152]: 7fbf4e0ae700 / msgr-worker-2 > Aug 11 16:55:48 bash[27152]: 7fbf4e8af700 / msgr-worker-1 > Aug 11 16:55:48 bash[27152]: 7fbf4f0b0700 / msgr-worker-0 > Aug 11 16:55:48 bash[27152]: 7fbf54b2cf40 / ceph-osd > Aug 11 16:55:48 bash[27152]: max_recent 10000 > Aug 11 16:55:48 bash[27152]: max_new 1000 > Aug 11 16:55:48 bash[27152]: log_file > > /var/lib/ceph/crash/2021-08-11T11:25:47.930411Z_a06defcc-19c6-41df-a37d-c071166cdcf3/log > Aug 11 16:55:48 bash[27152]: --- end dump of recent events --- > Aug 11 16:55:48 bash[27152]: reraise_fatal: default handler for signal 6 > didn't terminate the process? > > On Wed, Aug 11, 2021 at 5:53 PM Amudhan P <amudhan83@xxxxxxxxx> wrote: > > > Hi, > > I am using ceph version 15.2.7 in 4 node cluster my OSD's is > > continuously stopping and even if I start again it stops after some > time. I > > couldn't find anything from the log. > > I have set norecover and nobackfil as soon as I unset norecover OSD > starts > > to fail. > > > > cluster: > > id: b6437922-3edf-11eb-adc2-0cc47a5ec98a > > health: HEALTH_ERR > > 1/6307061 objects unfound (0.000%) > > noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub > > flag(s) set > > 19 osds down > > 62477 scrub errors > > Reduced data availability: 75 pgs inactive, 12 pgs down, 57 > > pgs peering, 90 pgs stale > > Possible data damage: 1 pg recovery_unfound, 7 pgs > inconsistent > > Degraded data redundancy: 3090660/12617416 objects degraded > > (24.495%), 394 pgs degraded, 399 pgs undersized > > 5 pgs not deep-scrubbed in time > > 127 daemons have recently crashed > > > > data: > > pools: 4 pools, 833 pgs > > objects: 6.31M objects, 23 TiB > > usage: 47 TiB used, 244 TiB / 291 TiB avail > > pgs: 9.004% pgs not active > > 3090660/12617416 objects degraded (24.495%) > > 315034/12617416 objects misplaced (2.497%) > > 1/6307061 objects unfound (0.000%) > > 368 active+undersized+degraded > > 299 active+clean > > 56 stale+peering > > 24 stale+active+clean > > 15 active+recovery_wait > > 12 active+undersized+remapped > > 11 active+undersized+degraded+remapped+backfill_wait > > 11 down > > 7 active+recovery_wait+degraded > > 7 active+clean+remapped > > 5 active+clean+remapped+inconsistent > > 5 stale+activating+undersized > > 4 active+recovering+degraded > > 2 stale+active+recovery_wait+degraded > > 1 active+recovery_unfound+undersized+degraded+remapped > > 1 stale+remapped+peering > > 1 stale+activating > > 1 stale+down > > 1 active+remapped+backfill_wait > > 1 active+undersized+remapped+inconsistent > > 1 > > active+undersized+degraded+remapped+inconsistent+backfill_wait > > > > > > what needs to be done to recover this? > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx