Re: OSDs crashing/flapping

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Torkil,

it looks like you're facing pretty well known problem with RocksDB performance degradation caused by bulk data removal. This has been discussed multiple times in this mailing list.

And here is one of the relevant tracker: https://tracker.ceph.com/issues/40741

To eliminate the effect you might want to do manual DB compaction using ceph-kvstore-tool for all your OSDs. This will cure current degraded OSD state but unfortunately doesn't protect from future similar cases.


Thanks,

Igor

On 8/4/2022 10:17 AM, Torkil Svensgaard wrote:
Hi

We have a lot of OSDs flapping during recovery and eventually they don't come up again until kicked with "ceph orch daemon restart osd.x".

ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)

6 hosts connected by 2 x 10 GB. Most data in EC 2+2 rbd pool.

"
# ceph -s
  cluster:
    id:     3b7736c6-00e4-11ec-a3c5-3cecef467984
    health: HEALTH_WARN
            2 host(s) running different kernel versions
            noscrub,nodeep-scrub,nosnaptrim flag(s) set
            Degraded data redundancy: 95909/1023542135 objects degraded (0.009%), 10 pgs degraded, 6 pgs undersized
            484 pgs not deep-scrubbed in time
            725 pgs not scrubbed in time
            11 daemons have recently crashed
            4 slow ops, oldest one blocked for 178 sec, daemons [osd.13,osd.15,osd.19,osd.46,osd.50] have slow ops.

  services:
    mon:        5 daemons, quorum test-ceph-03,test-ceph-04,dcn-ceph-03,dcn-ceph-02,dcn-ceph-01 (age 51s)     mgr:        dcn-ceph-01.dzercj(active, since 20h), standbys: dcn-ceph-03.lrhaxo
    mds:        1/1 daemons up, 2 standby
    osd:        118 osds: 118 up (since 3m), 118 in (since 86m); 137 remapped pgs
                flags noscrub,nodeep-scrub,nosnaptrim
    rbd-mirror: 2 daemons active (2 hosts)

  data:
    volumes: 1/1 healthy
    pools:   9 pools, 2737 pgs
    objects: 257.30M objects, 334 TiB
    usage:   673 TiB used, 680 TiB / 1.3 PiB avail
    pgs:     95909/1023542135 objects degraded (0.009%)
             7498287/1023542135 objects misplaced (0.733%)
             2505 active+clean
             98   active+remapped+backfilling
             85   active+clean+snaptrim_wait
             32   active+remapped+backfill_wait
             6    active+clean+laggy
             5    active+undersized+degraded+remapped+backfilling
             4    active+recovering+degraded
             1    active+recovering+degraded+remapped
             1    active+undersized+remapped+backfilling

  io:
    client:   45 KiB/s rd, 1.3 MiB/s wr, 52 op/s rd, 91 op/s wr
    recovery: 1.2 GiB/s, 467 objects/s

  progress:
    Global Recovery Event (20h)
      [==========================..] (remaining: 65m)
"

Crash info for one OSD:

"
022-08-03T10:34:50.179+0000 7fdedd02f700 -1 *** Caught signal (Aborted) **
 in thread 7fdedd02f700 thread_name:tp_osd_tp

 ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12c20) [0x7fdf00a78c20]
 2: pread64()
 3: (KernelDevice::read_random(unsigned long, unsigned long, char*, bool)+0x40d) [0x55701b4c0f0d]  4: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, char*)+0x60d) [0x55701b05ee7d]  5: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x24) [0x55701b08e6d4]  6: (rocksdb::LegacyRandomAccessFileWrapper::Read(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*, rocksdb::IODebugContext*) const+0x26) [0x55701b529396]  7: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*, bool) const+0xdc7) [0x55701b745267]
 8: (rocksdb::BlockFetcher::ReadBlockContents()+0x4b5) [0x55701b69fa45]
 9: (rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const+0x919) [0x55701b691ba9]  10: (rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool) const+0x286) [0x55701b691f86]  11: (rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable const*, rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*)+0xf1) [0x55701b760891]  12: (rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*) const+0xfe) [0x55701b760b5e]  13: (rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*) const+0x43) [0x55701b69aae3]  14: (rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::ReadOptions const&, rocksdb::FilterBlockReader*, rocksdb::Slice const&, bool, rocksdb::SliceTransform const*, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*) const+0x9e) [0x55701b68118e]  15: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform const*, bool)+0x180) [0x55701b6829a0]  16: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform const*, rocksdb::HistogramImpl*, bool, int)+0x170) [0x55701b5e2370]  17: (rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool)+0x3ad) [0x55701b5ffbbd]  18: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions)+0x537) [0x55701b50fbe7]  19: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*)+0x4d) [0x55701b5104fd]  20: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x35b) [0x55701b4d617b]  21: (BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)+0xf7) [0x55701af50b57]  22: (BlueStore::_wctx_finish(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*, std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, std::allocator<BlueStore::SharedBlob*> >*)+0xeaf) [0x55701af908cf]  23: (BlueStore::_do_truncate(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, std::allocator<BlueStore::SharedBlob*> >*)+0x3a8) [0x55701af91f68]  24: (BlueStore::_do_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>)+0xce) [0x55701af9289e]  25: (BlueStore::_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&)+0x22c) [0x55701af9441c]  26: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x1fcf) [0x55701afb8dbf]  27: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x55701afba1a6]  28: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x85) [0x55701aad7765]  29: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0xf3) [0x55701aa6c6d3]  30: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2d8) [0x55701aa9ee98]  31: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xc8) [0x55701aa9f038]  32: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x55701aa90d48]  33: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55701b1025b4]  34: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55701b105254]
 35: /lib64/libpthread.so.0(+0x817a) [0x7fdf00a6e17a]
 36: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

"

Some errors seen in the journal one the host:

"
Aug 04 06:57:59 dcn-ceph-01 bash[5683]: debug 2022-08-04T06:57:59.735+0000 7fa687612700 -1 librbd::SnapshotRemoveRequest: 0x559541832780 handle_trash_snap: failed to move snapshot to trash: (16) Device or resource busy

Aug 04 06:57:59 dcn-ceph-01 bash[7700]: debug 2022-08-04T06:57:59.827+0000 7f6f7695e700 -1 osd.25 138094 heartbeat_check: no reply from 172.21.15.55:6844 osd.100 since back 2022-08-04T06:57:22.475054+0000 front 2022-08-04T06:57:22.475002+0000 (oldest deadline 2022-08-04T06:57:46.573685+0000)

Aug 04 06:59:34 dcn-ceph-01 bash[8385]: debug 2022-08-04T06:59:34.398+0000 7f4b440c4700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f4b25676700' had timed out after 15.000000954s

Aug 04 06:59:31 dcn-ceph-01 bash[5230]: debug 2022-08-04T06:59:31.619+0000 7fab9ede5700  1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7fab9ede5700' had timed out after 0.000000000s

Aug 04 06:59:29 dcn-ceph-01 bash[5230]: debug 2022-08-04T06:59:29.808+0000 7fab9cde1700  1 mon.dcn-ceph-01@4(electing) e21 collect_metadata md0:  no unique device id for md0: fallback method has no model nor serial'
"

There are a lot of those 15s heartbeat timeouts.

I'm not sure what to look for. The cluster is quite busy but OSDs shouldn't flap or crash like that? Perhaps some heartbeat/timeout values could be tweaked?

Mvh.

Torkil

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux