Hi, Below are the logs in one of the failed OSD. Aug 11 16:55:48 bash[27152]: debug -20> 2021-08-11T11:25:47.433+0000 7fbf3b819700 3 osd.12 6697 handle_osd_map epochs [6696,6697], i have 6697, src has [ Aug 11 16:55:48 bash[27152]: debug -19> 2021-08-11T11:25:47.433+0000 7fbf32006700 5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564 (4460'174466,6312'18356 Aug 11 16:55:48 bash[27152]: debug -18> 2021-08-11T11:25:47.433+0000 7fbf32006700 5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564 (4460'174466,6312'18356 Aug 11 16:55:48 bash[27152]: debug -17> 2021-08-11T11:25:47.433+0000 7fbf32006700 5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564 (4460'174466,6312'18356 Aug 11 16:55:48 bash[27152]: debug -16> 2021-08-11T11:25:47.433+0000 7fbf32006700 5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564 (4460'174466,6312'18356 Aug 11 16:55:48 bash[27152]: debug -15> 2021-08-11T11:25:47.441+0000 7fbf3b819700 3 osd.12 6697 handle_osd_map epochs [6696,6697], i have 6697, src has [ Aug 11 16:55:48 bash[27152]: debug -14> 2021-08-11T11:25:47.561+0000 7fbf3a817700 2 osd.12 6697 ms_handle_refused con 0x563b53a3cc00 session 0x563b51aecb Aug 11 16:55:48 bash[27152]: debug -13> 2021-08-11T11:25:47.561+0000 7fbf3a817700 10 monclient: _send_mon_message to mon.strg-node2 at v2: 10.0.103.2:3300/ Aug 11 16:55:48 bash[27152]: debug -12> 2021-08-11T11:25:47.565+0000 7fbf3b819700 2 osd.12 6697 ms_handle_refused con 0x563b66226000 session 0 Aug 11 16:55:48 bash[27152]: debug -11> 2021-08-11T11:25:47.581+0000 7fbf3b819700 2 osd.12 6697 ms_handle_refused con 0x563b66227c00 session 0 Aug 11 16:55:48 bash[27152]: debug -10> 2021-08-11T11:25:47.581+0000 7fbf4e0ae700 10 monclient: get_auth_request con 0x563b53a4f400 auth_method 0 Aug 11 16:55:48 bash[27152]: debug -9> 2021-08-11T11:25:47.581+0000 7fbf39815700 2 osd.12 6697 ms_handle_refused con 0x563b53a3c800 session 0x563b679120 Aug 11 16:55:48 bash[27152]: debug -8> 2021-08-11T11:25:47.581+0000 7fbf39815700 10 monclient: _send_mon_message to mon.strg-node2 at v2: 10.0.103.2:3300/ Aug 11 16:55:48 bash[27152]: debug -7> 2021-08-11T11:25:47.581+0000 7fbf4f0b0700 10 monclient: get_auth_request con 0x563b6331d000 auth_method 0 Aug 11 16:55:48 bash[27152]: debug -6> 2021-08-11T11:25:47.581+0000 7fbf4e8af700 10 monclient: get_auth_request con 0x563b53a4f000 auth_method 0 Aug 11 16:55:48 bash[27152]: debug -5> 2021-08-11T11:25:47.717+0000 7fbf4f0b0700 10 monclient: get_auth_request con 0x563b66226c00 auth_method 0 Aug 11 16:55:48 bash[27152]: debug -4> 2021-08-11T11:25:47.789+0000 7fbf43623700 5 prioritycache tune_memory target: 1073741824 mapped: 388874240 unmap Aug 11 16:55:48 bash[27152]: debug -3> 2021-08-11T11:25:47.925+0000 7fbf32807700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ Aug 11 16:55:48 bash[27152]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZ Aug 11 16:55:48 bash[27152]: ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable) Aug 11 16:55:48 bash[27152]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x563b46835dbe] Aug 11 16:55:48 bash[27152]: 2: (()+0x504fd8) [0x563b46835fd8] Aug 11 16:55:48 bash[27152]: 3: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x5f5) [0x563b46918c25] Aug 11 16:55:48 bash[27152]: 4: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x563b46b74 Aug 11 16:55:48 bash[27152]: 5: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x563b469364df] Aug 11 16:55:48 bash[27152]: 6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x563b46f6f224] Aug 11 16:55:48 bash[27152]: 7: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563b46f71e84] Aug 11 16:55:48 bash[27152]: 8: (()+0x82de) [0x7fbf528952de] Aug 11 16:55:48 bash[27152]: 9: (clone()+0x43) [0x7fbf515cce83] Aug 11 16:55:48 bash[27152]: debug -2> 2021-08-11T11:25:47.929+0000 7fbf32807700 -1 *** Caught signal (Aborted) ** Aug 11 16:55:48 bash[27152]: in thread 7fbf32807700 thread_name:tp_osd_tp Aug 11 16:55:48 bash[27152]: ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable) Aug 11 16:55:48 bash[27152]: 1: (()+0x12dd0) [0x7fbf5289fdd0] Aug 11 16:55:48 bash[27152]: 2: (gsignal()+0x10f) [0x7fbf5150870f] Aug 11 16:55:48 bash[27152]: 3: (abort()+0x127) [0x7fbf514f2b25] Aug 11 16:55:48 bash[27152]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x563b46835e0f] Aug 11 16:55:48 bash[27152]: 5: (()+0x504fd8) [0x563b46835fd8] Aug 11 16:55:48 bash[27152]: 6: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x5f5) [0x563b46918c25] Aug 11 16:55:48 bash[27152]: 7: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x563b46b74 Aug 11 16:55:48 bash[27152]: 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x563b469364df] Aug 11 16:55:48 bash[27152]: 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x563b46f6f224] Aug 11 16:55:48 bash[27152]: 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563b46f71e84] Aug 11 16:55:48 bash[27152]: 11: (()+0x82de) [0x7fbf528952de] Aug 11 16:55:48 bash[27152]: 12: (clone()+0x43) [0x7fbf515cce83] Aug 11 16:55:48 bash[27152]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Aug 11 16:55:48 bash[27152]: debug -1> 2021-08-11T11:25:48.045+0000 7fbf3f9f4700 10 monclient: tick Aug 11 16:55:48 bash[27152]: debug 0> 2021-08-11T11:25:48.045+0000 7fbf3f9f4700 10 monclient: _check_auth_rotating have uptodate secrets (they expire af Aug 11 16:55:48 bash[27152]: --- logging levels --- Aug 11 16:55:48 bash[27152]: 0/ 5 none Aug 11 16:55:48 bash[27152]: 0/ 1 lockdep Aug 11 16:55:48 bash[27152]: 0/ 1 context Aug 11 16:55:48 bash[27152]: 1/ 1 crush Aug 11 16:55:48 bash[27152]: 1/ 5 mds Aug 11 16:55:48 bash[27152]: 1/ 5 mds_balancer Aug 11 16:55:48 bash[27152]: 1/ 5 mds_locker Aug 11 16:55:48 bash[27152]: 1/ 5 mds_log Aug 11 16:55:48 bash[27152]: --- pthread ID / name mapping for recent threads --- Aug 11 16:55:48 bash[27152]: 7fbf30002700 / osd_srv_heartbt Aug 11 16:55:48 bash[27152]: 7fbf30803700 / tp_osd_tp Aug 11 16:55:48 bash[27152]: 7fbf31004700 / tp_osd_tp Aug 11 16:55:48 bash[27152]: 7fbf31805700 / tp_osd_tp Aug 11 16:55:48 bash[27152]: 7fbf32006700 / tp_osd_tp Aug 11 16:55:48 bash[27152]: 7fbf32807700 / tp_osd_tp Aug 11 16:55:48 bash[27152]: 7fbf39815700 / ms_dispatch Aug 11 16:55:48 bash[27152]: 7fbf3a817700 / ms_dispatch Aug 11 16:55:48 bash[27152]: 7fbf3b819700 / ms_dispatch Aug 11 16:55:48 bash[27152]: 7fbf3c81b700 / rocksdb:dump_st Aug 11 16:55:48 bash[27152]: 7fbf3d617700 / fn_anonymous Aug 11 16:55:48 bash[27152]: 7fbf3e619700 / cfin Aug 11 16:55:48 bash[27152]: 7fbf3f9f4700 / safe_timer Aug 11 16:55:48 bash[27152]: 7fbf409f6700 / ms_dispatch Aug 11 16:55:48 bash[27152]: 7fbf43623700 / bstore_mempool Aug 11 16:55:48 bash[27152]: 7fbf48833700 / fn_anonymous Aug 11 16:55:48 bash[27152]: 7fbf4a036700 / safe_timer Aug 11 16:55:48 bash[27152]: 7fbf4b8a9700 / safe_timer Aug 11 16:55:48 bash[27152]: 7fbf4c0aa700 / signal_handler Aug 11 16:55:48 bash[27152]: 7fbf4d0ac700 / admin_socket Aug 11 16:55:48 bash[27152]: 7fbf4d8ad700 / service Aug 11 16:55:48 bash[27152]: 7fbf4e0ae700 / msgr-worker-2 Aug 11 16:55:48 bash[27152]: 7fbf4e8af700 / msgr-worker-1 Aug 11 16:55:48 bash[27152]: 7fbf4f0b0700 / msgr-worker-0 Aug 11 16:55:48 bash[27152]: 7fbf54b2cf40 / ceph-osd Aug 11 16:55:48 bash[27152]: max_recent 10000 Aug 11 16:55:48 bash[27152]: max_new 1000 Aug 11 16:55:48 bash[27152]: log_file /var/lib/ceph/crash/2021-08-11T11:25:47.930411Z_a06defcc-19c6-41df-a37d-c071166cdcf3/log Aug 11 16:55:48 bash[27152]: --- end dump of recent events --- Aug 11 16:55:48 bash[27152]: reraise_fatal: default handler for signal 6 didn't terminate the process? On Wed, Aug 11, 2021 at 5:53 PM Amudhan P <amudhan83@xxxxxxxxx> wrote: > Hi, > I am using ceph version 15.2.7 in 4 node cluster my OSD's is > continuously stopping and even if I start again it stops after some time. I > couldn't find anything from the log. > I have set norecover and nobackfil as soon as I unset norecover OSD starts > to fail. > > cluster: > id: b6437922-3edf-11eb-adc2-0cc47a5ec98a > health: HEALTH_ERR > 1/6307061 objects unfound (0.000%) > noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub > flag(s) set > 19 osds down > 62477 scrub errors > Reduced data availability: 75 pgs inactive, 12 pgs down, 57 > pgs peering, 90 pgs stale > Possible data damage: 1 pg recovery_unfound, 7 pgs inconsistent > Degraded data redundancy: 3090660/12617416 objects degraded > (24.495%), 394 pgs degraded, 399 pgs undersized > 5 pgs not deep-scrubbed in time > 127 daemons have recently crashed > > data: > pools: 4 pools, 833 pgs > objects: 6.31M objects, 23 TiB > usage: 47 TiB used, 244 TiB / 291 TiB avail > pgs: 9.004% pgs not active > 3090660/12617416 objects degraded (24.495%) > 315034/12617416 objects misplaced (2.497%) > 1/6307061 objects unfound (0.000%) > 368 active+undersized+degraded > 299 active+clean > 56 stale+peering > 24 stale+active+clean > 15 active+recovery_wait > 12 active+undersized+remapped > 11 active+undersized+degraded+remapped+backfill_wait > 11 down > 7 active+recovery_wait+degraded > 7 active+clean+remapped > 5 active+clean+remapped+inconsistent > 5 stale+activating+undersized > 4 active+recovering+degraded > 2 stale+active+recovery_wait+degraded > 1 active+recovery_unfound+undersized+degraded+remapped > 1 stale+remapped+peering > 1 stale+activating > 1 stale+down > 1 active+remapped+backfill_wait > 1 active+undersized+remapped+inconsistent > 1 > active+undersized+degraded+remapped+inconsistent+backfill_wait > > > what needs to be done to recover this? > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx