Re: ceph osd continously fails

Amudhan P <amudhan83@xxxxxxxxx> · Wed, 11 Aug 2021 19:22:06 +0530

Hi,

Below are the logs in one of the failed OSD.

Aug 11 16:55:48 bash[27152]: debug    -20> 2021-08-11T11:25:47.433+0000
7fbf3b819700  3 osd.12 6697 handle_osd_map epochs [6696,6697], i have 6697,
src has [
Aug 11 16:55:48 bash[27152]: debug    -19> 2021-08-11T11:25:47.433+0000
7fbf32006700  5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564
(4460'174466,6312'18356
Aug 11 16:55:48 bash[27152]: debug    -18> 2021-08-11T11:25:47.433+0000
7fbf32006700  5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564
(4460'174466,6312'18356
Aug 11 16:55:48 bash[27152]: debug    -17> 2021-08-11T11:25:47.433+0000
7fbf32006700  5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564
(4460'174466,6312'18356
Aug 11 16:55:48 bash[27152]: debug    -16> 2021-08-11T11:25:47.433+0000
7fbf32006700  5 osd.12 pg_epoch: 6697 pg[2.14b( v 6312'183564
(4460'174466,6312'18356
Aug 11 16:55:48 bash[27152]: debug    -15> 2021-08-11T11:25:47.441+0000
7fbf3b819700  3 osd.12 6697 handle_osd_map epochs [6696,6697], i have 6697,
src has [
Aug 11 16:55:48 bash[27152]: debug    -14> 2021-08-11T11:25:47.561+0000
7fbf3a817700  2 osd.12 6697 ms_handle_refused con 0x563b53a3cc00 session
0x563b51aecb
Aug 11 16:55:48 bash[27152]: debug    -13> 2021-08-11T11:25:47.561+0000
7fbf3a817700 10 monclient: _send_mon_message to mon.strg-node2 at v2:
10.0.103.2:3300/
Aug 11 16:55:48 bash[27152]: debug    -12> 2021-08-11T11:25:47.565+0000
7fbf3b819700  2 osd.12 6697 ms_handle_refused con 0x563b66226000 session 0
Aug 11 16:55:48 bash[27152]: debug    -11> 2021-08-11T11:25:47.581+0000
7fbf3b819700  2 osd.12 6697 ms_handle_refused con 0x563b66227c00 session 0
Aug 11 16:55:48 bash[27152]: debug    -10> 2021-08-11T11:25:47.581+0000
7fbf4e0ae700 10 monclient: get_auth_request con 0x563b53a4f400 auth_method 0
Aug 11 16:55:48 bash[27152]: debug     -9> 2021-08-11T11:25:47.581+0000
7fbf39815700  2 osd.12 6697 ms_handle_refused con 0x563b53a3c800 session
0x563b679120
Aug 11 16:55:48 bash[27152]: debug     -8> 2021-08-11T11:25:47.581+0000
7fbf39815700 10 monclient: _send_mon_message to mon.strg-node2 at v2:
10.0.103.2:3300/
Aug 11 16:55:48 bash[27152]: debug     -7> 2021-08-11T11:25:47.581+0000
7fbf4f0b0700 10 monclient: get_auth_request con 0x563b6331d000 auth_method 0
Aug 11 16:55:48 bash[27152]: debug     -6> 2021-08-11T11:25:47.581+0000
7fbf4e8af700 10 monclient: get_auth_request con 0x563b53a4f000 auth_method 0
Aug 11 16:55:48 bash[27152]: debug     -5> 2021-08-11T11:25:47.717+0000
7fbf4f0b0700 10 monclient: get_auth_request con 0x563b66226c00 auth_method 0
Aug 11 16:55:48 bash[27152]: debug     -4> 2021-08-11T11:25:47.789+0000
7fbf43623700  5 prioritycache tune_memory target: 1073741824 mapped:
388874240 unmap
Aug 11 16:55:48 bash[27152]: debug     -3> 2021-08-11T11:25:47.925+0000
7fbf32807700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_
Aug 11 16:55:48 bash[27152]:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZ
Aug 11 16:55:48 bash[27152]:  ceph version 15.2.7
(88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable)
Aug 11 16:55:48 bash[27152]:  1: (ceph::__ceph_assert_fail(char const*,
char const*, int, char const*)+0x158) [0x563b46835dbe]
Aug 11 16:55:48 bash[27152]:  2: (()+0x504fd8) [0x563b46835fd8]
Aug 11 16:55:48 bash[27152]:  3: (OSD::do_recovery(PG*, unsigned int,
unsigned long, ThreadPool::TPHandle&)+0x5f5) [0x563b46918c25]
Aug 11 16:55:48 bash[27152]:  4:
(ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x563b46b74
Aug 11 16:55:48 bash[27152]:  5: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x12ef) [0x563b469364df]
Aug 11 16:55:48 bash[27152]:  6:
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
[0x563b46f6f224]
Aug 11 16:55:48 bash[27152]:  7:
(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563b46f71e84]
Aug 11 16:55:48 bash[27152]:  8: (()+0x82de) [0x7fbf528952de]
Aug 11 16:55:48 bash[27152]:  9: (clone()+0x43) [0x7fbf515cce83]
Aug 11 16:55:48 bash[27152]: debug     -2> 2021-08-11T11:25:47.929+0000
7fbf32807700 -1 *** Caught signal (Aborted) **
Aug 11 16:55:48 bash[27152]:  in thread 7fbf32807700 thread_name:tp_osd_tp
Aug 11 16:55:48 bash[27152]:  ceph version 15.2.7
(88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable)
Aug 11 16:55:48 bash[27152]:  1: (()+0x12dd0) [0x7fbf5289fdd0]
Aug 11 16:55:48 bash[27152]:  2: (gsignal()+0x10f) [0x7fbf5150870f]
Aug 11 16:55:48 bash[27152]:  3: (abort()+0x127) [0x7fbf514f2b25]
Aug 11 16:55:48 bash[27152]:  4: (ceph::__ceph_assert_fail(char const*,
char const*, int, char const*)+0x1a9) [0x563b46835e0f]
Aug 11 16:55:48 bash[27152]:  5: (()+0x504fd8) [0x563b46835fd8]
Aug 11 16:55:48 bash[27152]:  6: (OSD::do_recovery(PG*, unsigned int,
unsigned long, ThreadPool::TPHandle&)+0x5f5) [0x563b46918c25]
Aug 11 16:55:48 bash[27152]:  7:
(ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x563b46b74
Aug 11 16:55:48 bash[27152]:  8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x12ef) [0x563b469364df]
Aug 11 16:55:48 bash[27152]:  9:
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
[0x563b46f6f224]
Aug 11 16:55:48 bash[27152]:  10:
(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563b46f71e84]
Aug 11 16:55:48 bash[27152]:  11: (()+0x82de) [0x7fbf528952de]
Aug 11 16:55:48 bash[27152]:  12: (clone()+0x43) [0x7fbf515cce83]
Aug 11 16:55:48 bash[27152]:  NOTE: a copy of the executable, or `objdump
-rdS <executable>` is needed to interpret this.
Aug 11 16:55:48 bash[27152]: debug     -1> 2021-08-11T11:25:48.045+0000
7fbf3f9f4700 10 monclient: tick
Aug 11 16:55:48 bash[27152]: debug      0> 2021-08-11T11:25:48.045+0000
7fbf3f9f4700 10 monclient: _check_auth_rotating have uptodate secrets (they
expire af
Aug 11 16:55:48 bash[27152]: --- logging levels ---
Aug 11 16:55:48 bash[27152]:    0/ 5 none
Aug 11 16:55:48 bash[27152]:    0/ 1 lockdep
Aug 11 16:55:48 bash[27152]:    0/ 1 context
Aug 11 16:55:48 bash[27152]:    1/ 1 crush
Aug 11 16:55:48 bash[27152]:    1/ 5 mds
Aug 11 16:55:48 bash[27152]:    1/ 5 mds_balancer
Aug 11 16:55:48 bash[27152]:    1/ 5 mds_locker
Aug 11 16:55:48 bash[27152]:    1/ 5 mds_log
Aug 11 16:55:48 bash[27152]: --- pthread ID / name mapping for recent
threads ---
Aug 11 16:55:48 bash[27152]:   7fbf30002700 / osd_srv_heartbt
Aug 11 16:55:48 bash[27152]:   7fbf30803700 / tp_osd_tp
Aug 11 16:55:48 bash[27152]:   7fbf31004700 / tp_osd_tp
Aug 11 16:55:48 bash[27152]:   7fbf31805700 / tp_osd_tp
Aug 11 16:55:48 bash[27152]:   7fbf32006700 / tp_osd_tp
Aug 11 16:55:48 bash[27152]:   7fbf32807700 / tp_osd_tp
Aug 11 16:55:48 bash[27152]:   7fbf39815700 / ms_dispatch
Aug 11 16:55:48 bash[27152]:   7fbf3a817700 / ms_dispatch
Aug 11 16:55:48 bash[27152]:   7fbf3b819700 / ms_dispatch
Aug 11 16:55:48 bash[27152]:   7fbf3c81b700 / rocksdb:dump_st
Aug 11 16:55:48 bash[27152]:   7fbf3d617700 / fn_anonymous
Aug 11 16:55:48 bash[27152]:   7fbf3e619700 / cfin
Aug 11 16:55:48 bash[27152]:   7fbf3f9f4700 / safe_timer
Aug 11 16:55:48 bash[27152]:   7fbf409f6700 / ms_dispatch
Aug 11 16:55:48 bash[27152]:   7fbf43623700 / bstore_mempool
Aug 11 16:55:48 bash[27152]:   7fbf48833700 / fn_anonymous
Aug 11 16:55:48 bash[27152]:   7fbf4a036700 / safe_timer
Aug 11 16:55:48 bash[27152]:   7fbf4b8a9700 / safe_timer
Aug 11 16:55:48 bash[27152]:   7fbf4c0aa700 / signal_handler
Aug 11 16:55:48 bash[27152]:   7fbf4d0ac700 / admin_socket
Aug 11 16:55:48 bash[27152]:   7fbf4d8ad700 / service
Aug 11 16:55:48 bash[27152]:   7fbf4e0ae700 / msgr-worker-2
Aug 11 16:55:48 bash[27152]:   7fbf4e8af700 / msgr-worker-1
Aug 11 16:55:48 bash[27152]:   7fbf4f0b0700 / msgr-worker-0
Aug 11 16:55:48 bash[27152]:   7fbf54b2cf40 / ceph-osd
Aug 11 16:55:48 bash[27152]:   max_recent     10000
Aug 11 16:55:48 bash[27152]:   max_new         1000
Aug 11 16:55:48 bash[27152]:   log_file
/var/lib/ceph/crash/2021-08-11T11:25:47.930411Z_a06defcc-19c6-41df-a37d-c071166cdcf3/log
Aug 11 16:55:48 bash[27152]: --- end dump of recent events ---
Aug 11 16:55:48 bash[27152]: reraise_fatal: default handler for signal 6
didn't terminate the process?

On Wed, Aug 11, 2021 at 5:53 PM Amudhan P <amudhan83@xxxxxxxxx> wrote:

> Hi,
> I am using ceph version 15.2.7 in 4 node cluster my OSD's is
> continuously stopping and even if I start again it stops after some time. I
> couldn't find anything from the log.
> I have set norecover and nobackfil as soon as I unset norecover OSD starts
> to fail.
>
>  cluster:
>     id:     b6437922-3edf-11eb-adc2-0cc47a5ec98a
>     health: HEALTH_ERR
>             1/6307061 objects unfound (0.000%)
>             noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub
> flag(s) set
>             19 osds down
>             62477 scrub errors
>             Reduced data availability: 75 pgs inactive, 12 pgs down, 57
> pgs peering, 90 pgs stale
>             Possible data damage: 1 pg recovery_unfound, 7 pgs inconsistent
>             Degraded data redundancy: 3090660/12617416 objects degraded
> (24.495%), 394 pgs degraded, 399 pgs undersized
>             5 pgs not deep-scrubbed in time
>             127 daemons have recently crashed
>
>   data:
>     pools:   4 pools, 833 pgs
>     objects: 6.31M objects, 23 TiB
>     usage:   47 TiB used, 244 TiB / 291 TiB avail
>     pgs:     9.004% pgs not active
>              3090660/12617416 objects degraded (24.495%)
>              315034/12617416 objects misplaced (2.497%)
>              1/6307061 objects unfound (0.000%)
>              368 active+undersized+degraded
>              299 active+clean
>              56  stale+peering
>              24  stale+active+clean
>              15  active+recovery_wait
>              12  active+undersized+remapped
>              11  active+undersized+degraded+remapped+backfill_wait
>              11  down
>              7   active+recovery_wait+degraded
>              7   active+clean+remapped
>              5   active+clean+remapped+inconsistent
>              5   stale+activating+undersized
>              4   active+recovering+degraded
>              2   stale+active+recovery_wait+degraded
>              1   active+recovery_unfound+undersized+degraded+remapped
>              1   stale+remapped+peering
>              1   stale+activating
>              1   stale+down
>              1   active+remapped+backfill_wait
>              1   active+undersized+remapped+inconsistent
>              1
> active+undersized+degraded+remapped+inconsistent+backfill_wait
>
>
> what needs to be done to recover this?
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx