Re: OSDs crashing after server reboot.

Cassiano Pilipavicius <cpilipav@xxxxxxxxx> · Thu, 11 Mar 2021 14:34:47 -0300

Hi, really this error was only showing up when I've tried to run
ceph-bluestore-tool repair, In my 3 OSDs that keeps crashing, it show the
following log... please let me know if there is something I can do to get
the pool back to a functioning state.

Uptime(secs): 0.0 total, 0.0 interval
Flush(GB): cumulative 0.000, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.00 GB write, 21.51 MB/s write, 0.00 GB read, 0.00
MB/s read, 0.0 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00
MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0
level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for
pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0
memtable_compaction, 0 memtable_slowdown, interval 0 total count

** File Read Latency Histogram By Level [default] **

   -32> 2021-03-11 14:25:55.812 7f50161b7700  5 prioritycache tune_memory
target: 4294967296 mapped: 445022208 unmapped: 59588608 heap: 504610816 old
mem: 134217728 new mem: 2564495564
   -31> 2021-03-11 14:25:55.813 7f50161b7700  5 prioritycache tune_memory
target: 4294967296 mapped: 445210624 unmapped: 59400192 heap: 504610816 old
mem: 2564495564 new mem: 2816296009
   -30> 2021-03-11 14:25:55.813 7f50161b7700  5
bluestore.MempoolThread(0x558aa9a8ea98) _trim_shards cache_size: 2816296009
kv_alloc: 956301312 kv_used: 6321600 meta_alloc: 956301312 meta_used: 11680
data_alloc: 620756992 data_used: 151552
   -29> 2021-03-11 14:25:55.838 7f502962ba80  0 <cls>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/cls/cephfs/cls_cephfs.cc:197:
loading cephfs
   -28> 2021-03-11 14:25:55.839 7f502962ba80  0 <cls>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/cls/hello/cls_hello.cc:296:
loading cls_hello
   -27> 2021-03-11 14:25:55.840 7f502962ba80  0 _get_class not permitted to
load kvs
   -26> 2021-03-11 14:25:55.840 7f502962ba80  0 _get_class not permitted to
load lua
   -25> 2021-03-11 14:25:55.852 7f502962ba80  0 _get_class not permitted to
load sdk
   -24> 2021-03-11 14:25:55.853 7f502962ba80  0 osd.80 697960 crush map has
features 283675107524608, adjusting msgr requires for clients
   -23> 2021-03-11 14:25:55.853 7f502962ba80  0 osd.80 697960 crush map has
features 283675107524608 was 8705, adjusting msgr requires for mons
   -22> 2021-03-11 14:25:55.853 7f502962ba80  0 osd.80 697960 crush map has
features 3026702624700514304, adjusting msgr requires for osds
   -21> 2021-03-11 14:25:56.814 7f50161b7700  5 prioritycache tune_memory
target: 4294967296 mapped: 501989376 unmapped: 21192704 heap: 523182080 old
mem: 2816296009 new mem: 2842012351
   -20> 2021-03-11 14:25:57.816 7f50161b7700  5 prioritycache tune_memory
target: 4294967296 mapped: 542867456 unmapped: 19210240 heap: 562077696 old
mem: 2842012351 new mem: 2844985645
   -19> 2021-03-11 14:25:58.327 7f502962ba80  0 osd.80 697960 load_pgs
   -18> 2021-03-11 14:25:58.818 7f50161b7700  5 prioritycache tune_memory
target: 4294967296 mapped: 596754432 unmapped: 229376 heap: 596983808 old
mem: 2844985645 new mem: 2845356061
   -17> 2021-03-11 14:25:59.818 7f50161b7700  5 prioritycache tune_memory
target: 4294967296 mapped: 679067648 unmapped: 753664 heap: 679821312 old
mem: 2845356061 new mem: 2845406382
   -16> 2021-03-11 14:26:00.820 7f50161b7700  5 prioritycache tune_memory
target: 4294967296 mapped: 729161728 unmapped: 991232 heap: 730152960 old
mem: 2845406382 new mem: 2845414228
   -15> 2021-03-11 14:26:00.820 7f50161b7700  5
bluestore.MempoolThread(0x558aa9a8ea98) _trim_shards cache_size: 2845414228
kv_alloc: 1174405120 kv_used: 166707616 meta_alloc: 1006632960 meta_used:
54703 data_alloc: 637534208 data_used: 745472
   -14> 2021-03-11 14:26:01.735 7f502962ba80  0 osd.80 697960 load_pgs
opened 63 pgs
   -13> 2021-03-11 14:26:01.735 7f502962ba80  0 osd.80 697960 using
weightedpriority op queue with priority op cut off at 64.
   -12> 2021-03-11 14:26:01.736 7f502962ba80 -1 osd.80 697960
log_to_monitors {default=true}
   -11> 2021-03-11 14:26:01.743 7f502962ba80 -1 osd.80 697960
mon_cmd_maybe_osd_create fail: 'osd.80 has already bound to class 'backup',
can not reset class to 'hdd'; use 'ceph osd crush rm-device-class <id>' to
remove old class first': (16) Device or resource busy
   -10> 2021-03-11 14:26:01.746 7f502962ba80  0 osd.80 697960 done with
init, starting boot process
    -9> 2021-03-11 14:26:01.748 7f5013740700  4 mgrc handle_mgr_map Got map
version 175738
    -8> 2021-03-11 14:26:01.748 7f5013740700  4 mgrc handle_mgr_map Active
mgr is now [v2:10.69.57.2:6802/3699587,v1:10.69.57.2:6803/3699587]
    -7> 2021-03-11 14:26:01.748 7f5013740700  4 mgrc reconnect Starting new
session with [v2:10.69.57.2:6802/3699587,v1:10.69.57.2:6803/3699587]
    -6> 2021-03-11 14:26:01.752 7f5013740700  4 mgrc handle_mgr_configure
stats_period=5
    -5> 2021-03-11 14:26:01.752 7f5013740700  4 mgrc handle_mgr_configure
updated stats threshold: 5
    -4> 2021-03-11 14:26:01.822 7f50161b7700  5 prioritycache tune_memory
target: 4294967296 mapped: 802258944 unmapped: 245760 heap: 802504704 old
mem: 2845414228 new mem: 2845415533
    -3> 2021-03-11 14:26:02.826 7f50161b7700  5 prioritycache tune_memory
target: 4294967296 mapped: 815693824 unmapped: 6733824 heap: 822427648 old
mem: 2845415533 new mem: 2845415776
    -2> 2021-03-11 14:26:03.084 7f5005b9c700  0 log_channel(cluster) log
[INF] : 0.23 continuing backfill to osd.41 from (31979'7979,561511'9571]
MIN to 561511'9571
    -1> 2021-03-11 14:26:03.088 7f5003b98700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/osd/PGLog.cc:
In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t,
pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f5003b98700
time 2021-03-11 14:26:03.080329
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/osd/PGLog.cc:
368: FAILED ceph_assert(log.head >= olog.tail && olog.head >= log.tail)

 ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x14a) [0x558a9db62c7d]
 2: (()+0x4d8e45) [0x558a9db62e45]
 3: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&,
PGLog::LogEntryHandler*, bool&, bool&)+0x1c22) [0x558a9dd76a42]
 4: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&,
pg_shard_t)+0x64) [0x558a9dcd8234]
 5: (PG::RecoveryState::Stray::react(MLogRec const&)+0x22b) [0x558a9dd1adeb]
 6: (boost::statechart::simple_state<PG::RecoveryState::Stray,
PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0xa5) [0x558a9dd693f5]
 7: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>,
PG::RecoveryCtx*)+0x2dd) [0x558a9dd2d3ed]
 8: (OSD::dequeue_peering_evt(OSDShard*, PG*,
std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4)
[0x558a9dc69f34]
 9: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x51) [0x558a9ded2291]
 10: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x90f) [0x558a9dc5ea4f]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6)
[0x558a9e216e56]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558a9e219970]
 13: (()+0x7ea5) [0x7f5026673ea5]
 14: (clone()+0x6d) [0x7f50255369fd]

     0> 2021-03-11 14:26:03.099 7f5003b98700 -1 *** Caught signal (Aborted)
**
 in thread 7f5003b98700 thread_name:tp_osd_tp

 ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus
(stable)
 1: (()+0xf630) [0x7f502667b630]
 2: (gsignal()+0x37) [0x7f502546e3d7]
 3: (abort()+0x148) [0x7f502546fac8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x199) [0x558a9db62ccc]
 5: (()+0x4d8e45) [0x558a9db62e45]
 6: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&,
PGLog::LogEntryHandler*, bool&, bool&)+0x1c22) [0x558a9dd76a42]
 7: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&,
pg_shard_t)+0x64) [0x558a9dcd8234]
 8: (PG::RecoveryState::Stray::react(MLogRec const&)+0x22b) [0x558a9dd1adeb]
 9: (boost::statechart::simple_state<PG::RecoveryState::Stray,
PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0xa5) [0x558a9dd693f5]
 10: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>,
PG::RecoveryCtx*)+0x2dd) [0x558a9dd2d3ed]
 11: (OSD::dequeue_peering_evt(OSDShard*, PG*,
std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4)
[0x558a9dc69f34]
 12: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x51) [0x558a9ded2291]
 13: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x90f) [0x558a9dc5ea4f]
 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6)
[0x558a9e216e56]
 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x558a9e219970]
 16: (()+0x7ea5) [0x7f5026673ea5]
 17: (clone()+0x6d) [0x7f50255369fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   0/ 0 mds
   0/ 0 mds_balancer
   0/ 0 mds_locker
   0/ 0 mds_log
   0/ 0 mds_log_expire
   0/ 0 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 0 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   0/ 0 journal
   0/ 0 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   1/ 1 reserver
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.80.log
--- end dump of recent events ---

On Thu, Mar 11, 2021 at 1:23 PM Igor Fedotov <ifedotov@xxxxxxx> wrote:

> Hi Cassiano,
>
> the backtrace you've provided relates to the bug fixed by:
> https://github.com/ceph/ceph/pull/37793
>
> This fix is going to be releases with the upcoming v14.2.17.
>
>
> But I doubt that your original crashes have the same root cause - this
> issue appears during shutdown only.
>
> Anyway you can work around it by using different allocator: avl or bitmap.
>
>
> Thanks,
>
> Igor
>
>
> On 3/11/2021 6:21 PM, Cassiano Pilipavicius wrote:
> > Hi, please if someone know how to help, I have an HDD pool in mycluster
> and
> > after rebooting one server,  my osds has started to crash.
> >
> > This pool is a backup pool and have OSD as failure domain with an size
> of 2.
> >
> > After rebooting one server, My osds started to crash, and the thing is
> only
> > getting worse. I have then tried to run ceph-bluestore-tool repair and I
> > receive what I think is the same error that shows on the osd logs:
> >
> > [root@cwvh13 ~]# ceph-bluestore-tool repair --path
> > /var/lib/ceph/osd/ceph-81 --log-level 10
> >
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/os/bluestore/Allocator.cc:
> > In function 'virtual Allocator::SocketHook::~SocketHook()' thread
> > 7f6467ffcec0 time 2021-03-11 12:13:12.121766
> >
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.16/rpm/el7/BUILD/ceph-14.2.16/src/os/bluestore/Allocator.cc:
> > 53: FAILED ceph_assert(r == 0)
> >   ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c)
> nautilus
> > (stable)
> >   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x14a) [0x7f645e1a7b27]
> >   2: (()+0x25ccef) [0x7f645e1a7cef]
> >   3: (()+0x3cd57f) [0x5642e85c457f]
> >   4: (HybridAllocator::~HybridAllocator()+0x17) [0x5642e85f3f37]
> >   5: (BlueStore::_close_alloc()+0x42) [0x5642e84379d2]
> >   6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x5642e84bbac8]
> >   7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x293)
> [0x5642e84bbf13]
> >   8: (main()+0x13cc) [0x5642e83caaec]
> >   9: (__libc_start_main()+0xf5) [0x7f645ae24555]
> >   10: (()+0x1fae9f) [0x5642e83f1e9f]
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx