Re: ceph osd continously fails

Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> · Thu, 12 Aug 2021 12:19:57 -0400

Can you send the results of "ceph daemon osd.0 status" and maybe do that
for a couple of osd ids ? You may need to target ones which are currently
running.

Respectfully,

*Wes Dillingham*
wes@xxxxxxxxxxxxxxxxx
LinkedIn <http://www.linkedin.com/in/wesleydillingham>

On Wed, Aug 11, 2021 at 9:51 AM Amudhan P <amudhan83@xxxxxxxxx> wrote:

> Hi,
>
> Below are the logs in one of the failed OSD.
>
> Aug 11 16:55:48 bash[27152]: debug    -20> 2021-08-11T11:25:47.433+0000
> 7fbf3b819700  3 osd.12 6697 handle_osd_map epochs [6696,6697], i have 6697,
> src has [
> Aug 11 16:55:48 bash[27152]: debug    -19> 2021-08-11T11:25:47.433+0000
> 7fbf32006700  5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564
> (4460'174466,6312'18356
> Aug 11 16:55:48 bash[27152]: debug    -18> 2021-08-11T11:25:47.433+0000
> 7fbf32006700  5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564
> (4460'174466,6312'18356
> Aug 11 16:55:48 bash[27152]: debug    -17> 2021-08-11T11:25:47.433+0000
> 7fbf32006700  5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564
> (4460'174466,6312'18356
> Aug 11 16:55:48 bash[27152]: debug    -16> 2021-08-11T11:25:47.433+0000
> 7fbf32006700  5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564
> (4460'174466,6312'18356
> Aug 11 16:55:48 bash[27152]: debug    -15> 2021-08-11T11:25:47.441+0000
> 7fbf3b819700  3 osd.12 6697 handle_osd_map epochs [6696,6697], i have 6697,
> src has [
> Aug 11 16:55:48 bash[27152]: debug    -14> 2021-08-11T11:25:47.561+0000
> 7fbf3a817700  2 osd.12 6697 ms_handle_refused con 0x563b53a3cc00 session
> 0x563b51aecb
> Aug 11 16:55:48 bash[27152]: debug    -13> 2021-08-11T11:25:47.561+0000
> 7fbf3a817700 10 monclient: _send_mon_message to mon.strg-node2 at v2:
> 10.0.103.2:3300/
> Aug 11 16:55:48 bash[27152]: debug    -12> 2021-08-11T11:25:47.565+0000
> 7fbf3b819700  2 osd.12 6697 ms_handle_refused con 0x563b66226000 session 0
> Aug 11 16:55:48 bash[27152]: debug    -11> 2021-08-11T11:25:47.581+0000
> 7fbf3b819700  2 osd.12 6697 ms_handle_refused con 0x563b66227c00 session 0
> Aug 11 16:55:48 bash[27152]: debug    -10> 2021-08-11T11:25:47.581+0000
> 7fbf4e0ae700 10 monclient: get_auth_request con 0x563b53a4f400 auth_method
> 0
> Aug 11 16:55:48 bash[27152]: debug     -9> 2021-08-11T11:25:47.581+0000
> 7fbf39815700  2 osd.12 6697 ms_handle_refused con 0x563b53a3c800 session
> 0x563b679120
> Aug 11 16:55:48 bash[27152]: debug     -8> 2021-08-11T11:25:47.581+0000
> 7fbf39815700 10 monclient: _send_mon_message to mon.strg-node2 at v2:
> 10.0.103.2:3300/
> Aug 11 16:55:48 bash[27152]: debug     -7> 2021-08-11T11:25:47.581+0000
> 7fbf4f0b0700 10 monclient: get_auth_request con 0x563b6331d000 auth_method
> 0
> Aug 11 16:55:48 bash[27152]: debug     -6> 2021-08-11T11:25:47.581+0000
> 7fbf4e8af700 10 monclient: get_auth_request con 0x563b53a4f000 auth_method
> 0
> Aug 11 16:55:48 bash[27152]: debug     -5> 2021-08-11T11:25:47.717+0000
> 7fbf4f0b0700 10 monclient: get_auth_request con 0x563b66226c00 auth_method
> 0
> Aug 11 16:55:48 bash[27152]: debug     -4> 2021-08-11T11:25:47.789+0000
> 7fbf43623700  5 prioritycache tune_memory target: 1073741824 mapped:
> 388874240 unmap
> Aug 11 16:55:48 bash[27152]: debug     -3> 2021-08-11T11:25:47.925+0000
> 7fbf32807700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_
> Aug 11 16:55:48 bash[27152]:
>
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZ
> Aug 11 16:55:48 bash[27152]:  ceph version 15.2.7
> (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable)
> Aug 11 16:55:48 bash[27152]:  1: (ceph::__ceph_assert_fail(char const*,
> char const*, int, char const*)+0x158) [0x563b46835dbe]
> Aug 11 16:55:48 bash[27152]:  2: (()+0x504fd8) [0x563b46835fd8]
> Aug 11 16:55:48 bash[27152]:  3: (OSD::do_recovery(PG*, unsigned int,
> unsigned long, ThreadPool::TPHandle&)+0x5f5) [0x563b46918c25]
> Aug 11 16:55:48 bash[27152]:  4:
> (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*,
> boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x563b46b74
> Aug 11 16:55:48 bash[27152]:  5: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x12ef) [0x563b469364df]
> Aug 11 16:55:48 bash[27152]:  6:
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
> [0x563b46f6f224]
> Aug 11 16:55:48 bash[27152]:  7:
> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563b46f71e84]
> Aug 11 16:55:48 bash[27152]:  8: (()+0x82de) [0x7fbf528952de]
> Aug 11 16:55:48 bash[27152]:  9: (clone()+0x43) [0x7fbf515cce83]
> Aug 11 16:55:48 bash[27152]: debug     -2> 2021-08-11T11:25:47.929+0000
> 7fbf32807700 -1 *** Caught signal (Aborted) **
> Aug 11 16:55:48 bash[27152]:  in thread 7fbf32807700 thread_name:tp_osd_tp
> Aug 11 16:55:48 bash[27152]:  ceph version 15.2.7
> (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable)
> Aug 11 16:55:48 bash[27152]:  1: (()+0x12dd0) [0x7fbf5289fdd0]
> Aug 11 16:55:48 bash[27152]:  2: (gsignal()+0x10f) [0x7fbf5150870f]
> Aug 11 16:55:48 bash[27152]:  3: (abort()+0x127) [0x7fbf514f2b25]
> Aug 11 16:55:48 bash[27152]:  4: (ceph::__ceph_assert_fail(char const*,
> char const*, int, char const*)+0x1a9) [0x563b46835e0f]
> Aug 11 16:55:48 bash[27152]:  5: (()+0x504fd8) [0x563b46835fd8]
> Aug 11 16:55:48 bash[27152]:  6: (OSD::do_recovery(PG*, unsigned int,
> unsigned long, ThreadPool::TPHandle&)+0x5f5) [0x563b46918c25]
> Aug 11 16:55:48 bash[27152]:  7:
> (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*,
> boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x563b46b74
> Aug 11 16:55:48 bash[27152]:  8: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x12ef) [0x563b469364df]
> Aug 11 16:55:48 bash[27152]:  9:
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
> [0x563b46f6f224]
> Aug 11 16:55:48 bash[27152]:  10:
> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563b46f71e84]
> Aug 11 16:55:48 bash[27152]:  11: (()+0x82de) [0x7fbf528952de]
> Aug 11 16:55:48 bash[27152]:  12: (clone()+0x43) [0x7fbf515cce83]
> Aug 11 16:55:48 bash[27152]:  NOTE: a copy of the executable, or `objdump
> -rdS <executable>` is needed to interpret this.
> Aug 11 16:55:48 bash[27152]: debug     -1> 2021-08-11T11:25:48.045+0000
> 7fbf3f9f4700 10 monclient: tick
> Aug 11 16:55:48 bash[27152]: debug      0> 2021-08-11T11:25:48.045+0000
> 7fbf3f9f4700 10 monclient: _check_auth_rotating have uptodate secrets (they
> expire af
> Aug 11 16:55:48 bash[27152]: --- logging levels ---
> Aug 11 16:55:48 bash[27152]:    0/ 5 none
> Aug 11 16:55:48 bash[27152]:    0/ 1 lockdep
> Aug 11 16:55:48 bash[27152]:    0/ 1 context
> Aug 11 16:55:48 bash[27152]:    1/ 1 crush
> Aug 11 16:55:48 bash[27152]:    1/ 5 mds
> Aug 11 16:55:48 bash[27152]:    1/ 5 mds_balancer
> Aug 11 16:55:48 bash[27152]:    1/ 5 mds_locker
> Aug 11 16:55:48 bash[27152]:    1/ 5 mds_log
> Aug 11 16:55:48 bash[27152]: --- pthread ID / name mapping for recent
> threads ---
> Aug 11 16:55:48 bash[27152]:   7fbf30002700 / osd_srv_heartbt
> Aug 11 16:55:48 bash[27152]:   7fbf30803700 / tp_osd_tp
> Aug 11 16:55:48 bash[27152]:   7fbf31004700 / tp_osd_tp
> Aug 11 16:55:48 bash[27152]:   7fbf31805700 / tp_osd_tp
> Aug 11 16:55:48 bash[27152]:   7fbf32006700 / tp_osd_tp
> Aug 11 16:55:48 bash[27152]:   7fbf32807700 / tp_osd_tp
> Aug 11 16:55:48 bash[27152]:   7fbf39815700 / ms_dispatch
> Aug 11 16:55:48 bash[27152]:   7fbf3a817700 / ms_dispatch
> Aug 11 16:55:48 bash[27152]:   7fbf3b819700 / ms_dispatch
> Aug 11 16:55:48 bash[27152]:   7fbf3c81b700 / rocksdb:dump_st
> Aug 11 16:55:48 bash[27152]:   7fbf3d617700 / fn_anonymous
> Aug 11 16:55:48 bash[27152]:   7fbf3e619700 / cfin
> Aug 11 16:55:48 bash[27152]:   7fbf3f9f4700 / safe_timer
> Aug 11 16:55:48 bash[27152]:   7fbf409f6700 / ms_dispatch
> Aug 11 16:55:48 bash[27152]:   7fbf43623700 / bstore_mempool
> Aug 11 16:55:48 bash[27152]:   7fbf48833700 / fn_anonymous
> Aug 11 16:55:48 bash[27152]:   7fbf4a036700 / safe_timer
> Aug 11 16:55:48 bash[27152]:   7fbf4b8a9700 / safe_timer
> Aug 11 16:55:48 bash[27152]:   7fbf4c0aa700 / signal_handler
> Aug 11 16:55:48 bash[27152]:   7fbf4d0ac700 / admin_socket
> Aug 11 16:55:48 bash[27152]:   7fbf4d8ad700 / service
> Aug 11 16:55:48 bash[27152]:   7fbf4e0ae700 / msgr-worker-2
> Aug 11 16:55:48 bash[27152]:   7fbf4e8af700 / msgr-worker-1
> Aug 11 16:55:48 bash[27152]:   7fbf4f0b0700 / msgr-worker-0
> Aug 11 16:55:48 bash[27152]:   7fbf54b2cf40 / ceph-osd
> Aug 11 16:55:48 bash[27152]:   max_recent     10000
> Aug 11 16:55:48 bash[27152]:   max_new         1000
> Aug 11 16:55:48 bash[27152]:   log_file
>
> /var/lib/ceph/crash/2021-08-11T11:25:47.930411Z_a06defcc-19c6-41df-a37d-c071166cdcf3/log
> Aug 11 16:55:48 bash[27152]: --- end dump of recent events ---
> Aug 11 16:55:48 bash[27152]: reraise_fatal: default handler for signal 6
> didn't terminate the process?
>
> On Wed, Aug 11, 2021 at 5:53 PM Amudhan P <amudhan83@xxxxxxxxx> wrote:
>
> > Hi,
> > I am using ceph version 15.2.7 in 4 node cluster my OSD's is
> > continuously stopping and even if I start again it stops after some
> time. I
> > couldn't find anything from the log.
> > I have set norecover and nobackfil as soon as I unset norecover OSD
> starts
> > to fail.
> >
> >  cluster:
> >     id:     b6437922-3edf-11eb-adc2-0cc47a5ec98a
> >     health: HEALTH_ERR
> >             1/6307061 objects unfound (0.000%)
> >             noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub
> > flag(s) set
> >             19 osds down
> >             62477 scrub errors
> >             Reduced data availability: 75 pgs inactive, 12 pgs down, 57
> > pgs peering, 90 pgs stale
> >             Possible data damage: 1 pg recovery_unfound, 7 pgs
> inconsistent
> >             Degraded data redundancy: 3090660/12617416 objects degraded
> > (24.495%), 394 pgs degraded, 399 pgs undersized
> >             5 pgs not deep-scrubbed in time
> >             127 daemons have recently crashed
> >
> >   data:
> >     pools:   4 pools, 833 pgs
> >     objects: 6.31M objects, 23 TiB
> >     usage:   47 TiB used, 244 TiB / 291 TiB avail
> >     pgs:     9.004% pgs not active
> >              3090660/12617416 objects degraded (24.495%)
> >              315034/12617416 objects misplaced (2.497%)
> >              1/6307061 objects unfound (0.000%)
> >              368 active+undersized+degraded
> >              299 active+clean
> >              56  stale+peering
> >              24  stale+active+clean
> >              15  active+recovery_wait
> >              12  active+undersized+remapped
> >              11  active+undersized+degraded+remapped+backfill_wait
> >              11  down
> >              7   active+recovery_wait+degraded
> >              7   active+clean+remapped
> >              5   active+clean+remapped+inconsistent
> >              5   stale+activating+undersized
> >              4   active+recovering+degraded
> >              2   stale+active+recovery_wait+degraded
> >              1   active+recovery_unfound+undersized+degraded+remapped
> >              1   stale+remapped+peering
> >              1   stale+activating
> >              1   stale+down
> >              1   active+remapped+backfill_wait
> >              1   active+undersized+remapped+inconsistent
> >              1
> > active+undersized+degraded+remapped+inconsistent+backfill_wait
> >
> >
> > what needs to be done to recover this?
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx