Re: Recovery stuck and Multiple PG fails

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Suresh,

The problem is some of my OSD services is not stable it crashes
continuously.

I have attached OSD log lines during the failure which are already in debug
mode.

let me know if you need more details.

On Sat, Aug 14, 2021 at 8:10 PM Suresh Rama <sstkadu@xxxxxxxxx> wrote:

> Amudhan,
>
> Have you looked at the logs and did you try enabling debug to see why the
> OSDs are marked down? There should be some reason right? Just focus on the
> MON and take one node/OSD by enabling debug to see what is happening.
> https://docs.ceph.com/en/latest/cephadm/operations/.
>
> Thanks,
> Suresh
>
> On Sat, Aug 14, 2021, 9:53 AM Amudhan P <amudhan83@xxxxxxxxx> wrote:
>
>> Hi,
>> I am stuck with ceph cluster with multiple PG errors due to multiple OSD
>> was stopped and starting OSD's manually again didn't help. OSD service
>> stops again there is no issue with HDD for sure but for some reason, OSD
>> stops.
>>
>> I am using running ceph version 15.2.5 on podman container.
>>
>> How do I recover these pg failures?
>>
>> can someone help me to recover this or where to look further?
>>
>>     pgs:     0.360% pgs not active
>>              124186/5082364 objects degraded (2.443%)
>>              29899/5082364 objects misplaced (0.588%)
>>              670 active+clean
>>              69  active+undersized+remapped
>>              26  active+undersized+degraded+remapped+backfill_wait
>>              16  active+undersized+remapped+backfill_wait
>>              15  active+undersized+degraded+remapped
>>              13  active+clean+remapped
>>              9   active+recovery_wait+degraded
>>              4   active+remapped+backfill_wait
>>              3   stale+down
>>              3   active+undersized+remapped+inconsistent
>>              2   active+recovery_wait+degraded+remapped
>>              1   active+recovering+degraded+remapped
>>              1   active+clean+remapped+inconsistent
>>              1   active+recovering+degraded
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
Aug 14 20:25:32 node1 bash[29321]: debug    -16> 2021-08-14T14:55:32.139+0000 7f2097869700 10 monclient: handle_auth_request added challenge on 0x5564eccdb400
Aug 14 20:25:32 node1 bash[29321]: debug    -15> 2021-08-14T14:55:32.139+0000 7f207afc0700  5 osd.7 pg_epoch: 7180 pg[2.cd( v 6838'194480 (6838'187296,6838'194480] local-lis/les=7171/7172 n=5007 ec=226/226 lis/c=7176/6927 les/c/f=7177/6928/0 sis=7180) [7,34]/[7,47] r=0 lpr=7180 pi=[6927,7180)/1 crt=6838'194480 lcod 0'0 mlcod 0'0 remapped+peering mbc={}] exit Started/Primary/Peering/GetInfo 0.486478 5 0.000268
Aug 14 20:25:32 node1 bash[29321]: debug    -14> 2021-08-14T14:55:32.139+0000 7f207afc0700  5 osd.7 pg_epoch: 7180 pg[2.cd( v 6838'194480 (6838'187296,6838'194480] local-lis/les=7171/7172 n=5007 ec=226/226 lis/c=7176/6927 les/c/f=7177/6928/0 sis=7180) [7,34]/[7,47] r=0 lpr=7180 pi=[6927,7180)/1 crt=6838'194480 lcod 0'0 mlcod 0'0 remapped+peering mbc={}] enter Started/Primary/Peering/GetLog
Aug 14 20:25:32 node1 bash[29321]: debug    -13> 2021-08-14T14:55:32.139+0000 7f2083fd2700  3 osd.7 7180 handle_osd_map epochs [7180,7180], i have 7180, src has [5697,7180]
Aug 14 20:25:32 node1 bash[29321]: debug    -12> 2021-08-14T14:55:32.143+0000 7f207afc0700  5 osd.7 pg_epoch: 7180 pg[2.cd( v 6838'194480 (6838'187296,6838'194480] local-lis/les=7176/7177 n=5007 ec=226/226 lis/c=7176/6927 les/c/f=7177/6928/0 sis=7180) [7,34]/[7,47] backfill=[34] r=0 lpr=7180 pi=[6927,7180)/1 crt=6838'194480 lcod 0'0 mlcod 0'0 remapped+peering mbc={}] exit Started/Primary/Peering/GetLog 0.004066 2 0.000112
Aug 14 20:25:32 node1 bash[29321]: debug    -11> 2021-08-14T14:55:32.143+0000 7f207afc0700  5 osd.7 pg_epoch: 7180 pg[2.cd( v 6838'194480 (6838'187296,6838'194480] local-lis/les=7176/7177 n=5007 ec=226/226 lis/c=7176/6927 les/c/f=7177/6928/0 sis=7180) [7,34]/[7,47] backfill=[34] r=0 lpr=7180 pi=[6927,7180)/1 crt=6838'194480 lcod 0'0 mlcod 0'0 remapped+peering mbc={}] enter Started/Primary/Peering/GetMissing
Aug 14 20:25:32 node1 bash[29321]: debug    -10> 2021-08-14T14:55:32.143+0000 7f207afc0700  5 osd.7 pg_epoch: 7180 pg[2.cd( v 6838'194480 (6838'187296,6838'194480] local-lis/les=7176/7177 n=5007 ec=226/226 lis/c=7176/6927 les/c/f=7177/6928/0 sis=7180) [7,34]/[7,47] backfill=[34] r=0 lpr=7180 pi=[6927,7180)/1 crt=6838'194480 lcod 0'0 mlcod 0'0 remapped+peering mbc={}] exit Started/Primary/Peering/GetMissing 0.000057 0 0.000000
Aug 14 20:25:32 node1 bash[29321]: debug     -9> 2021-08-14T14:55:32.143+0000 7f207afc0700  5 osd.7 pg_epoch: 7180 pg[2.cd( v 6838'194480 (6838'187296,6838'194480] local-lis/les=7176/7177 n=5007 ec=226/226 lis/c=7176/6927 les/c/f=7177/6928/0 sis=7180) [7,34]/[7,47] backfill=[34] r=0 lpr=7180 pi=[6927,7180)/1 crt=6838'194480 lcod 0'0 mlcod 0'0 remapped+peering mbc={}] enter Started/Primary/Peering/WaitUpThru
Aug 14 20:25:32 node1 bash[29321]: debug     -8> 2021-08-14T14:55:32.147+0000 7f2083fd2700  3 osd.7 7180 handle_osd_map epochs [7180,7180], i have 7180, src has [5697,7180]
Aug 14 20:25:32 node1 bash[29321]: debug     -7> 2021-08-14T14:55:32.203+0000 7f207a7bf700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.7/rpm/el8/BUILD/ceph-15.2.7/src/osd/OSD.cc: In function 'void OSD::do_recovery(PG*, epoch_t, uint64_t, ThreadPool::TPHandle&)' thread 7f207a7bf700 time 2021-08-14T14:55:32.203489+0000
Aug 14 20:25:32 node1 bash[29321]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.7/rpm/el8/BUILD/ceph-15.2.7/src/osd/OSD.cc: 9521: FAILED ceph_assert(started <= reserved_pushes)
Aug 14 20:25:32 node1 bash[29321]:  ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable)
Aug 14 20:25:32 node1 bash[29321]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x5564c7727dbe]
Aug 14 20:25:32 node1 bash[29321]:  2: (()+0x504fd8) [0x5564c7727fd8]
Aug 14 20:25:32 node1 bash[29321]:  3: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x5f5) [0x5564c780ac25]
Aug 14 20:25:32 node1 bash[29321]:  4: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x5564c7a66a3d]
Aug 14 20:25:32 node1 bash[29321]:  5: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x5564c78284df]
Aug 14 20:25:32 node1 bash[29321]:  6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5564c7e61224]
Aug 14 20:25:32 node1 bash[29321]:  7: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5564c7e63e84]
Aug 14 20:25:32 node1 bash[29321]:  8: (()+0x82de) [0x7f209b04e2de]
Aug 14 20:25:32 node1 bash[29321]:  9: (clone()+0x43) [0x7f2099d85e83]
Aug 14 20:25:32 node1 bash[29321]: debug     -6> 2021-08-14T14:55:32.207+0000 7f207a7bf700 -1 *** Caught signal (Aborted) **
Aug 14 20:25:32 node1 bash[29321]:  in thread 7f207a7bf700 thread_name:tp_osd_tp
Aug 14 20:25:32 node1 bash[29321]:  ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable)
Aug 14 20:25:32 node1 bash[29321]:  1: (()+0x12dd0) [0x7f209b058dd0]
Aug 14 20:25:32 node1 bash[29321]:  2: (gsignal()+0x10f) [0x7f2099cc170f]
Aug 14 20:25:32 node1 bash[29321]:  3: (abort()+0x127) [0x7f2099cabb25]
Aug 14 20:25:32 node1 bash[29321]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x5564c7727e0f]
Aug 14 20:25:32 node1 bash[29321]:  5: (()+0x504fd8) [0x5564c7727fd8]
Aug 14 20:25:32 node1 bash[29321]:  6: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x5f5) [0x5564c780ac25]
Aug 14 20:25:32 node1 bash[29321]:  7: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x5564c7a66a3d]
Aug 14 20:25:32 node1 bash[29321]:  8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x5564c78284df]
Aug 14 20:25:32 node1 bash[29321]:  9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5564c7e61224]
Aug 14 20:25:32 node1 bash[29321]:  10: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5564c7e63e84]
Aug 14 20:25:32 node1 bash[29321]:  11: (()+0x82de) [0x7f209b04e2de]
Aug 14 20:25:32 node1 bash[29321]:  12: (clone()+0x43) [0x7f2099d85e83]
Aug 14 20:25:32 node1 bash[29321]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Aug 14 20:25:32 node1 bash[29321]: debug     -5> 2021-08-14T14:55:32.291+0000 7f2096867700 10 monclient: get_auth_request con 0x5564ec884800 auth_method 0
Aug 14 20:25:32 node1 bash[29321]: debug     -4> 2021-08-14T14:55:32.291+0000 7f2097068700 10 monclient: get_auth_request con 0x5564ec884400 auth_method 0
Aug 14 20:25:32 node1 bash[29321]: debug     -3> 2021-08-14T14:55:32.291+0000 7f2097869700 10 monclient: get_auth_request con 0x5564ec885800 auth_method 0
Aug 14 20:25:32 node1 bash[29321]: debug     -2> 2021-08-14T14:55:32.291+0000 7f2096867700 10 monclient: get_auth_request con 0x5564ec885400 auth_method 0
Aug 14 20:25:32 node1 bash[29321]: debug     -1> 2021-08-14T14:55:32.291+0000 7f2097869700 10 monclient: get_auth_request con 0x5564ec884c00 auth_method 0
Aug 14 20:25:32 node1 bash[29321]: debug      0> 2021-08-14T14:55:32.291+0000 7f2097068700 10 monclient: get_auth_request con 0x5564ec885000 auth_method 0
Aug 14 20:25:32 node1 bash[29321]: --- logging levels ---
Aug 14 20:25:32 node1 bash[29321]:    0/ 5 none
Aug 14 20:25:32 node1 bash[29321]:    0/ 1 lockdep
Aug 14 20:25:32 node1 bash[29321]:    0/ 1 context
Aug 14 20:25:32 node1 bash[29321]:    1/ 1 crush
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 mds
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 mds_balancer
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 mds_locker
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 mds_log
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 mds_log_expire
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 mds_migrator
Aug 14 20:25:32 node1 bash[29321]:    0/ 1 buffer
Aug 14 20:25:32 node1 bash[29321]:    0/ 1 timer
Aug 14 20:25:32 node1 bash[29321]:    0/ 1 filer
Aug 14 20:25:32 node1 bash[29321]:    0/ 1 striper
Aug 14 20:25:32 node1 bash[29321]:    0/ 1 objecter
Aug 14 20:25:32 node1 bash[29321]:    0/ 5 rados
Aug 14 20:25:32 node1 bash[29321]:    0/ 5 rbd
Aug 14 20:25:32 node1 bash[29321]:    0/ 5 rbd_mirror
Aug 14 20:25:32 node1 bash[29321]:    0/ 5 rbd_replay
Aug 14 20:25:32 node1 bash[29321]:    0/ 5 rbd_rwl
Aug 14 20:25:32 node1 bash[29321]:    0/ 5 journaler
Aug 14 20:25:32 node1 bash[29321]:    0/ 5 objectcacher
Aug 14 20:25:32 node1 bash[29321]:    0/ 5 immutable_obj_cache
Aug 14 20:25:32 node1 bash[29321]:    0/ 5 client
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 osd
Aug 14 20:25:32 node1 bash[29321]:    0/ 5 optracker
Aug 14 20:25:32 node1 bash[29321]:    0/ 5 objclass
Aug 14 20:25:32 node1 bash[29321]:    1/ 3 filestore
Aug 14 20:25:32 node1 bash[29321]:    1/ 3 journal
Aug 14 20:25:32 node1 bash[29321]:    0/ 0 ms
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 mon
Aug 14 20:25:32 node1 bash[29321]:    0/10 monc
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 paxos
Aug 14 20:25:32 node1 bash[29321]:    0/ 5 tp
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 auth
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 crypto
Aug 14 20:25:32 node1 bash[29321]:    1/ 1 finisher
Aug 14 20:25:32 node1 bash[29321]:    1/ 1 reserver
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 heartbeatmap
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 perfcounter
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 rgw
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 rgw_sync
Aug 14 20:25:32 node1 bash[29321]:    1/10 civetweb
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 javaclient
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 asok
Aug 14 20:25:32 node1 bash[29321]:    1/ 1 throttle
Aug 14 20:25:32 node1 bash[29321]:    0/ 0 refs
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 compressor
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 bluestore
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 bluefs
Aug 14 20:25:32 node1 bash[29321]:    1/ 3 bdev
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 kstore
Aug 14 20:25:32 node1 bash[29321]:    4/ 5 rocksdb
Aug 14 20:25:32 node1 bash[29321]:    4/ 5 leveldb
Aug 14 20:25:32 node1 bash[29321]:    4/ 5 memdb
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 fuse
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 mgr
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 mgrc
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 dpdk
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 eventtrace
Aug 14 20:25:32 node1 bash[29321]:    1/ 5 prioritycache
Aug 14 20:25:32 node1 bash[29321]:    0/ 5 test
Aug 14 20:25:32 node1 bash[29321]:   -2/-2 (syslog threshold)
Aug 14 20:25:32 node1 bash[29321]:   99/99 (stderr threshold)
Aug 14 20:25:32 node1 bash[29321]: --- pthread ID / name mapping for recent threads ---
Aug 14 20:25:32 node1 bash[29321]:   7f20787bb700 / osd_srv_heartbt
Aug 14 20:25:32 node1 bash[29321]:   7f2078fbc700 / tp_osd_tp
Aug 14 20:25:32 node1 bash[29321]:   7f20797bd700 / tp_osd_tp
Aug 14 20:25:32 node1 bash[29321]:   7f2079fbe700 / tp_osd_tp
Aug 14 20:25:32 node1 bash[29321]:   7f207a7bf700 / tp_osd_tp
Aug 14 20:25:32 node1 bash[29321]:   7f207afc0700 / tp_osd_tp
Aug 14 20:25:32 node1 bash[29321]:   7f2083fd2700 / ms_dispatch
Aug 14 20:25:32 node1 bash[29321]:   7f2084fd4700 / rocksdb:dump_st
Aug 14 20:25:32 node1 bash[29321]:   7f2085dd0700 / fn_anonymous
Aug 14 20:25:32 node1 bash[29321]:   7f2086dd2700 / cfin
Aug 14 20:25:32 node1 bash[29321]:   7f20891af700 / ms_dispatch
Aug 14 20:25:32 node1 bash[29321]:   7f208bddc700 / bstore_mempool
Aug 14 20:25:32 node1 bash[29321]:   7f2090fec700 / fn_anonymous
Aug 14 20:25:32 node1 bash[29321]:   7f20927ef700 / safe_timer
Aug 14 20:25:32 node1 bash[29321]:   7f2094863700 / signal_handler
Aug 14 20:25:32 node1 bash[29321]:   7f2095865700 / admin_socket
Aug 14 20:25:32 node1 bash[29321]:   7f2096066700 / service
Aug 14 20:25:32 node1 bash[29321]:   7f2096867700 / msgr-worker-2
Aug 14 20:25:32 node1 bash[29321]:   7f2097068700 / msgr-worker-1
Aug 14 20:25:32 node1 bash[29321]:   7f2097869700 / msgr-worker-0
Aug 14 20:25:32 node1 bash[29321]:   7f209d2e5f40 / ceph-osd
Aug 14 20:25:32 node1 bash[29321]:   max_recent     10000
Aug 14 20:25:32 node1 bash[29321]:   max_new         1000
Aug 14 20:25:32 node1 bash[29321]:   log_file /var/lib/ceph/crash/2021-08-14T14:55:32.212451Z_ecf0acef-c485-46f2-8ddd-1e7963500fad/log
Aug 14 20:25:32 node1 bash[29321]: --- end dump of recent events ---
Aug 14 20:25:32 node1 bash[29321]: reraise_fatal: default handler for signal 6 didn't terminate the process?
Aug 14 20:25:32 node1 podman[29544]: 2021-08-14 20:25:32.72196885 +0530 IST m=+11.967692929 container died bfe523d6c640aef9aac768c4b65d73efcac7261087c2f139e95a63efccf96507 (image=docker.io/ceph/ceph:v15, name=ceph-b6437922-3edf-11eb-adc2-0cc47a5ec98a-osd.7)
Aug 14 20:25:32 node1 podman[29544]: 2021-08-14 20:25:32.821355264 +0530 IST m=+12.067079343 container remove bfe523d6c640aef9aac768c4b65d73efcac7261087c2f139e95a63efccf96507 (image=docker.io/ceph/ceph:v15, name=ceph-b6437922-3edf-11eb-adc2-0cc47a5ec98a-osd.7, org.label-schema.build-date=20200809, RELEASE=HEAD, ceph=True, GIT_CLEAN=True, org.label-schema.vendor=CentOS, GIT_COMMIT=74bc74245d4867f6e24980130b7b697011bf73e6, GIT_BRANCH=HEAD, org.label-schema.license=GPLv2, org.label-schema.name=CentOS Base Image, CEPH_POINT_RELEASE=-15.2.7, org.label-schema.schema-version=1.0, maintainer=Dimitri Savineau <dsavinea@xxxxxxxxxx>, GIT_REPO=https://github.com/ceph/ceph-container.git)
Aug 14 20:25:32 node1 systemd[1]: ceph-b6437922-3edf-11eb-adc2-0cc47a5ec98a@osd.7.service: Main process exited, code=exited, status=1/FAILURE
Aug 14 20:25:32 node1 podman[29779]: 2021-08-14 20:25:32.957376516 +0530 IST m=+0.113447239 container create c24bd3146bb8aaa2801bc47ecfd1c11f61b4000c9f6c615a5ded16f3bf8b5ab5 (image=docker.io/ceph/ceph:v15, name=ceph-b6437922-3edf-11eb-adc2-0cc47a5ec98a-osd.7-deactivate, GIT_REPO=https://github.com/ceph/ceph-container.git, GIT_COMMIT=74bc74245d4867f6e24980130b7b697011bf73e6, org.label-schema.schema-version=1.0, org.label-schema.license=GPLv2, ceph=True, RELEASE=HEAD, maintainer=Dimitri Savineau <dsavinea@xxxxxxxxxx>, GIT_CLEAN=True, GIT_BRANCH=HEAD, CEPH_POINT_RELEASE=-15.2.7, org.label-schema.name=CentOS Base Image, org.label-schema.vendor=CentOS, org.label-schema.build-date=20200809)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux