OSDs crashing/flapping

Torkil Svensgaard <torkil@xxxxxxxx> · Thu, 4 Aug 2022 09:17:04 +0200

Hi

We have a lot of OSDs flapping during recovery and eventually they don't 
come up again until kicked with "ceph orch daemon restart osd.x".

ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific 
(stable)

6 hosts connected by 2 x 10 GB. Most data in EC 2+2 rbd pool.

"
# ceph -s
  cluster:
    id:     3b7736c6-00e4-11ec-a3c5-3cecef467984
    health: HEALTH_WARN
            2 host(s) running different kernel versions
            noscrub,nodeep-scrub,nosnaptrim flag(s) set
            Degraded data redundancy: 95909/1023542135 objects degraded 
(0.009%), 10 pgs degraded, 6 pgs undersized
            484 pgs not deep-scrubbed in time
            725 pgs not scrubbed in time
            11 daemons have recently crashed
            4 slow ops, oldest one blocked for 178 sec, daemons 
[osd.13,osd.15,osd.19,osd.46,osd.50] have slow ops.

  services:
    mon:        5 daemons, quorum 
test-ceph-03,test-ceph-04,dcn-ceph-03,dcn-ceph-02,dcn-ceph-01 (age 51s)
    mgr:        dcn-ceph-01.dzercj(active, since 20h), standbys: 
dcn-ceph-03.lrhaxo
    mds:        1/1 daemons up, 2 standby
    osd:        118 osds: 118 up (since 3m), 118 in (since 86m); 137 
remapped pgs
                flags noscrub,nodeep-scrub,nosnaptrim
    rbd-mirror: 2 daemons active (2 hosts)

  data:
    volumes: 1/1 healthy
    pools:   9 pools, 2737 pgs
    objects: 257.30M objects, 334 TiB
    usage:   673 TiB used, 680 TiB / 1.3 PiB avail
    pgs:     95909/1023542135 objects degraded (0.009%)
             7498287/1023542135 objects misplaced (0.733%)
             2505 active+clean
             98   active+remapped+backfilling
             85   active+clean+snaptrim_wait
             32   active+remapped+backfill_wait
             6    active+clean+laggy
             5    active+undersized+degraded+remapped+backfilling
             4    active+recovering+degraded
             1    active+recovering+degraded+remapped
             1    active+undersized+remapped+backfilling

  io:
    client:   45 KiB/s rd, 1.3 MiB/s wr, 52 op/s rd, 91 op/s wr
    recovery: 1.2 GiB/s, 467 objects/s

  progress:
    Global Recovery Event (20h)
      [==========================..] (remaining: 65m)
"

Crash info for one OSD:

"
022-08-03T10:34:50.179+0000 7fdedd02f700 -1 *** Caught signal (Aborted) **
 in thread 7fdedd02f700 thread_name:tp_osd_tp

 ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific 
(stable)
 1: /lib64/libpthread.so.0(+0x12c20) [0x7fdf00a78c20]
 2: pread64()
 3: (KernelDevice::read_random(unsigned long, unsigned long, char*, 
bool)+0x40d) [0x55701b4c0f0d]
 4: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned 
long, char*)+0x60d) [0x55701b05ee7d]
 5: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
rocksdb::Slice*, char*) const+0x24) [0x55701b08e6d4]
 6: (rocksdb::LegacyRandomAccessFileWrapper::Read(unsigned long, 
unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*, 
rocksdb::IODebugContext*) const+0x26) [0x55701b529396]
 7: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
long, rocksdb::Slice*, char*, bool) const+0xdc7) [0x55701b745267]
 8: (rocksdb::BlockFetcher::ReadBlockContents()+0x4b5) [0x55701b69fa45]
 9: (rocksdb::Status 
rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, 
rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, 
rocksdb::UncompressionDict const&, 
rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, 
rocksdb::BlockType, rocksdb::GetContext*, 
rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const+0x919) 
[0x55701b691ba9]
 10: (rocksdb::Status 
rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, 
rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, 
rocksdb::UncompressionDict const&, 
rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, 
rocksdb::BlockType, rocksdb::GetContext*, 
rocksdb::BlockCacheLookupContext*, bool, bool) const+0x286) [0x55701b691f86]
 11: 
(rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable 
const*, rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, bool, 
rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, 
rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*)+0xf1) 
[0x55701b760891]
 12: 
(rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, 
rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, 
rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*) const+0xfe) 
[0x55701b760b5e]
 13: (rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, 
bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*) 
const+0x43) [0x55701b69aae3]
 14: 
(rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::ReadOptions 
const&, rocksdb::FilterBlockReader*, rocksdb::Slice const&, bool, 
rocksdb::SliceTransform const*, rocksdb::GetContext*, 
rocksdb::BlockCacheLookupContext*) const+0x9e) [0x55701b68118e]
 15: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform 
const*, bool)+0x180) [0x55701b6829a0]
 16: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, 
rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform 
const*, rocksdb::HistogramImpl*, bool, int)+0x170) [0x55701b5e2370]
 17: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*, 
rocksdb::MergeContext*, unsigned long*, bool*, bool*, unsigned long*, 
rocksdb::ReadCallback*, bool*, bool)+0x3ad) [0x55701b5ffbbd]
 18: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions)+0x537) 
[0x55701b50fbe7]
 19: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
rocksdb::PinnableSlice*)+0x4d) [0x55701b5104fd]
 20: (RocksDBStore::get(std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const&, 
std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x35b) 
[0x55701b4d617b]
 21: 
(BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)+0xf7) 
[0x55701af50b57]
 22: (BlueStore::_wctx_finish(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*, 
std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, 
std::allocator<BlueStore::SharedBlob*> >*)+0xeaf) [0x55701af908cf]
 23: (BlueStore::_do_truncate(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>, unsigned long, 
std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, 
std::allocator<BlueStore::SharedBlob*> >*)+0x3a8) [0x55701af91f68]
 24: (BlueStore::_do_remove(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>)+0xce) [0x55701af9289e]
 25: (BlueStore::_remove(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>&)+0x22c) [0x55701af9441c]
 26: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ceph::os::Transaction*)+0x1fcf) [0x55701afb8dbf]
 27: 
(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> 
>&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) 
[0x55701afba1a6]
 28: 
(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, 
ThreadPool::TPHandle*)+0x85) [0x55701aad7765]
 29: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap 
const>, ThreadPool::TPHandle*)+0xf3) [0x55701aa6c6d3]
 30: (OSD::dequeue_peering_evt(OSDShard*, PG*, 
std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2d8) 
[0x55701aa9ee98]
 31: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, 
ThreadPool::TPHandle&)+0xc8) [0x55701aa9f038]
 32: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0xc28) [0x55701aa90d48]
 33: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) 
[0x55701b1025b4]
 34: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55701b105254]
 35: /lib64/libpthread.so.0(+0x817a) [0x7fdf00a6e17a]
 36: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

"

Some errors seen in the journal one the host:

"
Aug 04 06:57:59 dcn-ceph-01 bash[5683]: debug 
2022-08-04T06:57:59.735+0000 7fa687612700 -1 
librbd::SnapshotRemoveRequest: 0x559541832780 handle_trash_snap: failed 
to move snapshot to trash: (16) Device or resource busy

Aug 04 06:57:59 dcn-ceph-01 bash[7700]: debug 
2022-08-04T06:57:59.827+0000 7f6f7695e700 -1 osd.25 138094 
heartbeat_check: no reply from 172.21.15.55:6844 osd.100 since back 
2022-08-04T06:57:22.475054+0000 front 2022-08-04T06:57:22.475002+0000 
(oldest deadline 2022-08-04T06:57:46.573685+0000)

Aug 04 06:59:34 dcn-ceph-01 bash[8385]: debug 
2022-08-04T06:59:34.398+0000 7f4b440c4700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f4b25676700' had timed out after 15.000000954s

Aug 04 06:59:31 dcn-ceph-01 bash[5230]: debug 
2022-08-04T06:59:31.619+0000 7fab9ede5700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7fab9ede5700' had timed out after 0.000000000s

Aug 04 06:59:29 dcn-ceph-01 bash[5230]: debug 
2022-08-04T06:59:29.808+0000 7fab9cde1700  1 mon.dcn-ceph-01@4(electing) 
e21 collect_metadata md0:  no unique device id for md0: fallback method 
has no model nor serial'
"

There are a lot of those 15s heartbeat timeouts.

I'm not sure what to look for. The cluster is quite busy but OSDs 
shouldn't flap or crash like that? Perhaps some heartbeat/timeout values 
could be tweaked?

Mvh.

Torkil

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: torkil@xxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx