Re: OSDs crashing/flapping

Torkil Svensgaard <torkil@xxxxxxxx> · Thu, 4 Aug 2022 12:36:18 +0200

On 8/4/22 09:17, Torkil Svensgaard wrote:
Hi

We have a lot of OSDs flapping during recovery and eventually they don't 
come up again until kicked with "ceph orch daemon restart osd.x".

This is the end of the log for one OSD going down for good:

"
2022-08-04T09:57:31.752+0000 7f3812cb2700  1 he'OSD::osd_op_tp thread 
0x7f37f5a67700' had timed out after 15.000000954s
2022-08-04T09:57:35.520+0000 7f37f5a67700  1 heartbeat_map clear_timeout 
'OSD::osd_op_tp thread 0x7f37f5a67700' had timed out after 15.000000954s
2022-08-04T09:57:35.520+0000 7f37f5a67700  0 
bluestore(/var/lib/ceph/osd/ceph-102) log_latency slow operation 
observed for submit_transact, latency = 63.815132141s
2022-08-04T09:57:35.531+0000 7f3806c9a700  0 
bluestore(/var/lib/ceph/osd/ceph-102) log_latency_fn slow operation 
observed for _txc_committed_kv, latency = 63.825824738s, txc = 
0x55f84d31d500
2022-08-04T09:57:35.554+0000 7f3802c92700  0 log_channel(cluster) log 
[WRN] : Monitor daemon marked osd.102 down, but it is still running
2022-08-04T09:57:35.554+0000 7f3802c92700  0 log_channel(cluster) log 
[DBG] : map e138671 wrongly marked me down at e138668
2022-08-04T09:57:35.554+0000 7f3802c92700 -1 osd.102 138671 
_committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 
600.000000 seconds, shutting down
2022-08-04T09:57:35.558+0000 7f3802c92700  0 osd.102 138671 
_committed_osd_maps shutdown OSD via async signal
2022-08-04T09:57:35.558+0000 7f38114af700 -1 received  signal: Interrupt 
from Kernel ( Could be generated by pthread_kill(), raise(), abort(), 
alarm() ) UID: 0
2022-08-04T09:57:35.558+0000 7f38114af700 -1 osd.102 138671 *** Got 
signal Interrupt ***
2022-08-04T09:57:35.558+0000 7f38114af700 -1 osd.102 138671 *** 
Immediate shutdown (osd_fast_shutdown=true) ***
"

Mvh.

Torkil

ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific 
(stable)

6 hosts connected by 2 x 10 GB. Most data in EC 2+2 rbd pool.

"
# ceph -s
   cluster:
     id:     3b7736c6-00e4-11ec-a3c5-3cecef467984
     health: HEALTH_WARN
             2 host(s) running different kernel versions
             noscrub,nodeep-scrub,nosnaptrim flag(s) set
             Degraded data redundancy: 95909/1023542135 objects degraded 
(0.009%), 10 pgs degraded, 6 pgs undersized
             484 pgs not deep-scrubbed in time
             725 pgs not scrubbed in time
             11 daemons have recently crashed
             4 slow ops, oldest one blocked for 178 sec, daemons 
[osd.13,osd.15,osd.19,osd.46,osd.50] have slow ops.

   services:
     mon:        5 daemons, quorum 
test-ceph-03,test-ceph-04,dcn-ceph-03,dcn-ceph-02,dcn-ceph-01 (age 51s)
     mgr:        dcn-ceph-01.dzercj(active, since 20h), standbys: 
dcn-ceph-03.lrhaxo
     mds:        1/1 daemons up, 2 standby
     osd:        118 osds: 118 up (since 3m), 118 in (since 86m); 137 
remapped pgs
                 flags noscrub,nodeep-scrub,nosnaptrim
     rbd-mirror: 2 daemons active (2 hosts)

   data:
     volumes: 1/1 healthy
     pools:   9 pools, 2737 pgs
     objects: 257.30M objects, 334 TiB
     usage:   673 TiB used, 680 TiB / 1.3 PiB avail
     pgs:     95909/1023542135 objects degraded (0.009%)
              7498287/1023542135 objects misplaced (0.733%)
              2505 active+clean
              98   active+remapped+backfilling
              85   active+clean+snaptrim_wait
              32   active+remapped+backfill_wait
              6    active+clean+laggy
              5    active+undersized+degraded+remapped+backfilling
              4    active+recovering+degraded
              1    active+recovering+degraded+remapped
              1    active+undersized+remapped+backfilling

   io:
     client:   45 KiB/s rd, 1.3 MiB/s wr, 52 op/s rd, 91 op/s wr
     recovery: 1.2 GiB/s, 467 objects/s

   progress:
     Global Recovery Event (20h)
       [==========================..] (remaining: 65m)
"

Crash info for one OSD:

"
022-08-03T10:34:50.179+0000 7fdedd02f700 -1 *** Caught signal (Aborted) **
  in thread 7fdedd02f700 thread_name:tp_osd_tp

  ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific 
(stable)
  1: /lib64/libpthread.so.0(+0x12c20) [0x7fdf00a78c20]
  2: pread64()
  3: (KernelDevice::read_random(unsigned long, unsigned long, char*, 
bool)+0x40d) [0x55701b4c0f0d]
  4: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned 
long, char*)+0x60d) [0x55701b05ee7d]
  5: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
rocksdb::Slice*, char*) const+0x24) [0x55701b08e6d4]
  6: (rocksdb::LegacyRandomAccessFileWrapper::Read(unsigned long, 
unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*, 
rocksdb::IODebugContext*) const+0x26) [0x55701b529396]
  7: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
long, rocksdb::Slice*, char*, bool) const+0xdc7) [0x55701b745267]
  8: (rocksdb::BlockFetcher::ReadBlockContents()+0x4b5) [0x55701b69fa45]
  9: (rocksdb::Status 
rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, 
rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, 
rocksdb::UncompressionDict const&, 
rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, 
rocksdb::BlockType, rocksdb::GetContext*, 
rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const+0x919) 
[0x55701b691ba9]
  10: (rocksdb::Status 
rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, 
rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, 
rocksdb::UncompressionDict const&, 
rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, 
rocksdb::BlockType, rocksdb::GetContext*, 
rocksdb::BlockCacheLookupContext*, bool, bool) const+0x286) 
[0x55701b691f86]
  11: 
(rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable 
const*, rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, bool, 
rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, 
rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*)+0xf1) 
[0x55701b760891]
  12: 
(rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, 
rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, 
rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*) const+0xfe) 
[0x55701b760b5e]
  13: (rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, 
bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*) 
const+0x43) [0x55701b69aae3]
  14: 
(rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::ReadOptions 
const&, rocksdb::FilterBlockReader*, rocksdb::Slice const&, bool, 
rocksdb::SliceTransform const*, rocksdb::GetContext*, 
rocksdb::BlockCacheLookupContext*) const+0x9e) [0x55701b68118e]
  15: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform 
const*, bool)+0x180) [0x55701b6829a0]
  16: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, 
rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform 
const*, rocksdb::HistogramImpl*, bool, int)+0x170) [0x55701b5e2370]
  17: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*, 
rocksdb::MergeContext*, unsigned long*, bool*, bool*, unsigned long*, 
rocksdb::ReadCallback*, bool*, bool)+0x3ad) [0x55701b5ffbbd]
  18: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions)+0x537) 
[0x55701b50fbe7]
  19: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
rocksdb::PinnableSlice*)+0x4d) [0x55701b5104fd]
  20: (RocksDBStore::get(std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const&, 
std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x35b) 
[0x55701b4d617b]
  21: 
(BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)+0xf7) 
[0x55701af50b57]
  22: (BlueStore::_wctx_finish(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*, 
std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, 
std::allocator<BlueStore::SharedBlob*> >*)+0xeaf) [0x55701af908cf]
  23: (BlueStore::_do_truncate(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>, unsigned long, 
std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, 
std::allocator<BlueStore::SharedBlob*> >*)+0x3a8) [0x55701af91f68]
  24: (BlueStore::_do_remove(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>)+0xce) [0x55701af9289e]
  25: (BlueStore::_remove(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>&)+0x22c) [0x55701af9441c]
  26: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ceph::os::Transaction*)+0x1fcf) [0x55701afb8dbf]
  27: 
(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> 
 >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) 
[0x55701afba1a6]
  28: 
(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, 
ThreadPool::TPHandle*)+0x85) [0x55701aad7765]
  29: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap 
const>, ThreadPool::TPHandle*)+0xf3) [0x55701aa6c6d3]
  30: (OSD::dequeue_peering_evt(OSDShard*, PG*, 
std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2d8) 
[0x55701aa9ee98]
  31: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, 
ThreadPool::TPHandle&)+0xc8) [0x55701aa9f038]
  32: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0xc28) [0x55701aa90d48]
  33: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) 
[0x55701b1025b4]
  34: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55701b105254]
  35: /lib64/libpthread.so.0(+0x817a) [0x7fdf00a6e17a]
  36: clone()
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

"

Some errors seen in the journal one the host:

"
Aug 04 06:57:59 dcn-ceph-01 bash[5683]: debug 
2022-08-04T06:57:59.735+0000 7fa687612700 -1 
librbd::SnapshotRemoveRequest: 0x559541832780 handle_trash_snap: failed 
to move snapshot to trash: (16) Device or resource busy

Aug 04 06:57:59 dcn-ceph-01 bash[7700]: debug 
2022-08-04T06:57:59.827+0000 7f6f7695e700 -1 osd.25 138094 
heartbeat_check: no reply from 172.21.15.55:6844 osd.100 since back 
2022-08-04T06:57:22.475054+0000 front 2022-08-04T06:57:22.475002+0000 
(oldest deadline 2022-08-04T06:57:46.573685+0000)

Aug 04 06:59:34 dcn-ceph-01 bash[8385]: debug 
2022-08-04T06:59:34.398+0000 7f4b440c4700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f4b25676700' had timed out after 15.000000954s

Aug 04 06:59:31 dcn-ceph-01 bash[5230]: debug 
2022-08-04T06:59:31.619+0000 7fab9ede5700  1 heartbeat_map reset_timeout 
'Monitor::cpu_tp thread 0x7fab9ede5700' had timed out after 0.000000000s

Aug 04 06:59:29 dcn-ceph-01 bash[5230]: debug 
2022-08-04T06:59:29.808+0000 7fab9cde1700  1 mon.dcn-ceph-01@4(electing) 
e21 collect_metadata md0:  no unique device id for md0: fallback method 
has no model nor serial'
"

There are a lot of those 15s heartbeat timeouts.

I'm not sure what to look for. The cluster is quite busy but OSDs 
shouldn't flap or crash like that? Perhaps some heartbeat/timeout values 
could be tweaked?

Mvh.

Torkil

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: torkil@xxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx