Hi, really this error was only showing up when I've tried to run ceph-bluestore-tool repair, In my 3 OSDs that keeps crashing, it show the following log... please let me know if there is something I can do to get the pool back to a functioning state. Uptime(secs): 0.0 total, 0.0 interval Flush(GB): cumulative 0.000, interval 0.000 AddFile(GB): cumulative 0.000, interval 0.000 AddFile(Total Files): cumulative 0, interval 0 AddFile(L0 Files): cumulative 0, interval 0 AddFile(Keys): cumulative 0, interval 0 Cumulative compaction: 0.00 GB write, 21.51 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count ** File Read Latency Histogram By Level [default] ** -32> 2021-03-11 14:25:55.812 7f50161b7700 5 prioritycache tune_memory target: 4294967296 mapped: 445022208 unmapped: 59588608 heap: 504610816 old mem: 134217728 new mem: 2564495564 -31> 2021-03-11 14:25:55.813 7f50161b7700 5 prioritycache tune_memory target: 4294967296 mapped: 445210624 unmapped: 59400192 heap: 504610816 old mem: 2564495564 new mem: 2816296009 -30> 2021-03-11 14:25:55.813 7f50161b7700 5 bluestore.MempoolThread(0x558aa9a8ea98) _trim_shards cache_size: 2816296009 kv_alloc: 956301312 kv_used: 6321600 meta_alloc: 956301312 meta_used: 11680 data_alloc: 620756992 data_used: 151552 -29> 2021-03-11 14:25:55.838 7f502962ba80 0 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/cls/cephfs/cls_cephfs.cc:197: loading cephfs -28> 2021-03-11 14:25:55.839 7f502962ba80 0 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/cls/hello/cls_hello.cc:296: loading cls_hello -27> 2021-03-11 14:25:55.840 7f502962ba80 0 _get_class not permitted to load kvs -26> 2021-03-11 14:25:55.840 7f502962ba80 0 _get_class not permitted to load lua -25> 2021-03-11 14:25:55.852 7f502962ba80 0 _get_class not permitted to load sdk -24> 2021-03-11 14:25:55.853 7f502962ba80 0 osd.80 697960 crush map has features 283675107524608, adjusting msgr requires for clients -23> 2021-03-11 14:25:55.853 7f502962ba80 0 osd.80 697960 crush map has features 283675107524608 was 8705, adjusting msgr requires for mons -22> 2021-03-11 14:25:55.853 7f502962ba80 0 osd.80 697960 crush map has features 3026702624700514304, adjusting msgr requires for osds -21> 2021-03-11 14:25:56.814 7f50161b7700 5 prioritycache tune_memory target: 4294967296 mapped: 501989376 unmapped: 21192704 heap: 523182080 old mem: 2816296009 new mem: 2842012351 -20> 2021-03-11 14:25:57.816 7f50161b7700 5 prioritycache tune_memory target: 4294967296 mapped: 542867456 unmapped: 19210240 heap: 562077696 old mem: 2842012351 new mem: 2844985645 -19> 2021-03-11 14:25:58.327 7f502962ba80 0 osd.80 697960 load_pgs -18> 2021-03-11 14:25:58.818 7f50161b7700 5 prioritycache tune_memory target: 4294967296 mapped: 596754432 unmapped: 229376 heap: 596983808 old mem: 2844985645 new mem: 2845356061 -17> 2021-03-11 14:25:59.818 7f50161b7700 5 prioritycache tune_memory target: 4294967296 mapped: 679067648 unmapped: 753664 heap: 679821312 old mem: 2845356061 new mem: 2845406382 -16> 2021-03-11 14:26:00.820 7f50161b7700 5 prioritycache tune_memory target: 4294967296 mapped: 729161728 unmapped: 991232 heap: 730152960 old mem: 2845406382 new mem: 2845414228 -15> 2021-03-11 14:26:00.820 7f50161b7700 5 bluestore.MempoolThread(0x558aa9a8ea98) _trim_shards cache_size: 2845414228 kv_alloc: 1174405120 kv_used: 166707616 meta_alloc: 1006632960 meta_used: 54703 data_alloc: 637534208 data_used: 745472 -14> 2021-03-11 14:26:01.735 7f502962ba80 0 osd.80 697960 load_pgs opened 63 pgs -13> 2021-03-11 14:26:01.735 7f502962ba80 0 osd.80 697960 using weightedpriority op queue with priority op cut off at 64. -12> 2021-03-11 14:26:01.736 7f502962ba80 -1 osd.80 697960 log_to_monitors {default=true} -11> 2021-03-11 14:26:01.743 7f502962ba80 -1 osd.80 697960 mon_cmd_maybe_osd_create fail: 'osd.80 has already bound to class 'backup', can not reset class to 'hdd'; use 'ceph osd crush rm-device-class <id>' to remove old class first': (16) Device or resource busy -10> 2021-03-11 14:26:01.746 7f502962ba80 0 osd.80 697960 done with init, starting boot process -9> 2021-03-11 14:26:01.748 7f5013740700 4 mgrc handle_mgr_map Got map version 175738 -8> 2021-03-11 14:26:01.748 7f5013740700 4 mgrc handle_mgr_map Active mgr is now [v2:10.69.57.2:6802/3699587,v1:10.69.57.2:6803/3699587] -7> 2021-03-11 14:26:01.748 7f5013740700 4 mgrc reconnect Starting new session with [v2:10.69.57.2:6802/3699587,v1:10.69.57.2:6803/3699587] -6> 2021-03-11 14:26:01.752 7f5013740700 4 mgrc handle_mgr_configure stats_period=5 -5> 2021-03-11 14:26:01.752 7f5013740700 4 mgrc handle_mgr_configure updated stats threshold: 5 -4> 2021-03-11 14:26:01.822 7f50161b7700 5 prioritycache tune_memory target: 4294967296 mapped: 802258944 unmapped: 245760 heap: 802504704 old mem: 2845414228 new mem: 2845415533 -3> 2021-03-11 14:26:02.826 7f50161b7700 5 prioritycache tune_memory target: 4294967296 mapped: 815693824 unmapped: 6733824 heap: 822427648 old mem: 2845415533 new mem: 2845415776 -2> 2021-03-11 14:26:03.084 7f5005b9c700 0 log_channel(cluster) log [INF] : 0.23 continuing backfill to osd.41 from (31979'7979,561511'9571] MIN to 561511'9571 -1> 2021-03-11 14:26:03.088 7f5003b98700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/osd/PGLog.cc: In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f5003b98700 time 2021-03-11 14:26:03.080329 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/osd/PGLog.cc: 368: FAILED ceph_assert(log.head >= olog.tail && olog.head >= log.tail) ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x558a9db62c7d] 2: (()+0x4d8e45) [0x558a9db62e45] 3: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1c22) [0x558a9dd76a42] 4: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x64) [0x558a9dcd8234] 5: (PG::RecoveryState::Stray::react(MLogRec const&)+0x22b) [0x558a9dd1adeb] 6: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa5) [0x558a9dd693f5] 7: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x2dd) [0x558a9dd2d3ed] 8: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x558a9dc69f34] 9: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x558a9ded2291] 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x558a9dc5ea4f] 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x558a9e216e56] 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558a9e219970] 13: (()+0x7ea5) [0x7f5026673ea5] 14: (clone()+0x6d) [0x7f50255369fd] 0> 2021-03-11 14:26:03.099 7f5003b98700 -1 *** Caught signal (Aborted) ** in thread 7f5003b98700 thread_name:tp_osd_tp ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable) 1: (()+0xf630) [0x7f502667b630] 2: (gsignal()+0x37) [0x7f502546e3d7] 3: (abort()+0x148) [0x7f502546fac8] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x558a9db62ccc] 5: (()+0x4d8e45) [0x558a9db62e45] 6: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1c22) [0x558a9dd76a42] 7: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x64) [0x558a9dcd8234] 8: (PG::RecoveryState::Stray::react(MLogRec const&)+0x22b) [0x558a9dd1adeb] 9: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa5) [0x558a9dd693f5] 10: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x2dd) [0x558a9dd2d3ed] 11: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x558a9dc69f34] 12: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x558a9ded2291] 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x558a9dc5ea4f] 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x558a9e216e56] 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558a9e219970] 16: (()+0x7ea5) [0x7f5026673ea5] 17: (clone()+0x6d) [0x7f50255369fd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 0/ 0 mds 0/ 0 mds_balancer 0/ 0 mds_locker 0/ 0 mds_log 0/ 0 mds_log_expire 0/ 0 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 0 filer 0/ 1 striper 0/ 0 objecter 0/ 0 rados 0/ 0 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 0 journaler 0/ 0 objectcacher 0/ 0 client 0/ 0 osd 0/ 0 optracker 0/ 0 objclass 0/ 0 filestore 0/ 0 journal 0/ 0 ms 0/ 0 mon 0/ 0 monc 0/ 0 paxos 0/ 0 tp 0/ 0 auth 1/ 5 crypto 0/ 0 finisher 1/ 1 reserver 0/ 0 heartbeatmap 0/ 0 perfcounter 0/ 0 rgw 1/ 5 rgw_sync 1/10 civetweb 1/ 5 javaclient 0/ 0 asok 0/ 0 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.80.log --- end dump of recent events --- On Thu, Mar 11, 2021 at 1:23 PM Igor Fedotov <ifedotov@xxxxxxx> wrote: > Hi Cassiano, > > the backtrace you've provided relates to the bug fixed by: > https://github.com/ceph/ceph/pull/37793 > > This fix is going to be releases with the upcoming v14.2.17. > > > But I doubt that your original crashes have the same root cause - this > issue appears during shutdown only. > > Anyway you can work around it by using different allocator: avl or bitmap. > > > Thanks, > > Igor > > > On 3/11/2021 6:21 PM, Cassiano Pilipavicius wrote: > > Hi, please if someone know how to help, I have an HDD pool in mycluster > and > > after rebooting one server, my osds has started to crash. > > > > This pool is a backup pool and have OSD as failure domain with an size > of 2. > > > > After rebooting one server, My osds started to crash, and the thing is > only > > getting worse. I have then tried to run ceph-bluestore-tool repair and I > > receive what I think is the same error that shows on the osd logs: > > > > [root@cwvh13 ~]# ceph-bluestore-tool repair --path > > /var/lib/ceph/osd/ceph-81 --log-level 10 > > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/os/bluestore/Allocator.cc: > > In function 'virtual Allocator::SocketHook::~SocketHook()' thread > > 7f6467ffcec0 time 2021-03-11 12:13:12.121766 > > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/os/bluestore/Allocator.cc: > > 53: FAILED ceph_assert(r == 0) > > ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) > nautilus > > (stable) > > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > > const*)+0x14a) [0x7f645e1a7b27] > > 2: (()+0x25ccef) [0x7f645e1a7cef] > > 3: (()+0x3cd57f) [0x5642e85c457f] > > 4: (HybridAllocator::~HybridAllocator()+0x17) [0x5642e85f3f37] > > 5: (BlueStore::_close_alloc()+0x42) [0x5642e84379d2] > > 6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x5642e84bbac8] > > 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x293) > [0x5642e84bbf13] > > 8: (main()+0x13cc) [0x5642e83caaec] > > 9: (__libc_start_main()+0xf5) [0x7f645ae24555] > > 10: (()+0x1fae9f) [0x5642e83f1e9f] > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx