Hi
We have a lot of OSDs flapping during recovery and eventually they don't
come up again until kicked with "ceph orch daemon restart osd.x".
ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
(stable)
6 hosts connected by 2 x 10 GB. Most data in EC 2+2 rbd pool.
"
# ceph -s
cluster:
id: 3b7736c6-00e4-11ec-a3c5-3cecef467984
health: HEALTH_WARN
2 host(s) running different kernel versions
noscrub,nodeep-scrub,nosnaptrim flag(s) set
Degraded data redundancy: 95909/1023542135 objects degraded
(0.009%), 10 pgs degraded, 6 pgs undersized
484 pgs not deep-scrubbed in time
725 pgs not scrubbed in time
11 daemons have recently crashed
4 slow ops, oldest one blocked for 178 sec, daemons
[osd.13,osd.15,osd.19,osd.46,osd.50] have slow ops.
services:
mon: 5 daemons, quorum
test-ceph-03,test-ceph-04,dcn-ceph-03,dcn-ceph-02,dcn-ceph-01 (age 51s)
mgr: dcn-ceph-01.dzercj(active, since 20h), standbys:
dcn-ceph-03.lrhaxo
mds: 1/1 daemons up, 2 standby
osd: 118 osds: 118 up (since 3m), 118 in (since 86m); 137
remapped pgs
flags noscrub,nodeep-scrub,nosnaptrim
rbd-mirror: 2 daemons active (2 hosts)
data:
volumes: 1/1 healthy
pools: 9 pools, 2737 pgs
objects: 257.30M objects, 334 TiB
usage: 673 TiB used, 680 TiB / 1.3 PiB avail
pgs: 95909/1023542135 objects degraded (0.009%)
7498287/1023542135 objects misplaced (0.733%)
2505 active+clean
98 active+remapped+backfilling
85 active+clean+snaptrim_wait
32 active+remapped+backfill_wait
6 active+clean+laggy
5 active+undersized+degraded+remapped+backfilling
4 active+recovering+degraded
1 active+recovering+degraded+remapped
1 active+undersized+remapped+backfilling
io:
client: 45 KiB/s rd, 1.3 MiB/s wr, 52 op/s rd, 91 op/s wr
recovery: 1.2 GiB/s, 467 objects/s
progress:
Global Recovery Event (20h)
[==========================..] (remaining: 65m)
"
Crash info for one OSD:
"
022-08-03T10:34:50.179+0000 7fdedd02f700 -1 *** Caught signal (Aborted) **
in thread 7fdedd02f700 thread_name:tp_osd_tp
ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
(stable)
1: /lib64/libpthread.so.0(+0x12c20) [0x7fdf00a78c20]
2: pread64()
3: (KernelDevice::read_random(unsigned long, unsigned long, char*,
bool)+0x40d) [0x55701b4c0f0d]
4: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned
long, char*)+0x60d) [0x55701b05ee7d]
5: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long,
rocksdb::Slice*, char*) const+0x24) [0x55701b08e6d4]
6: (rocksdb::LegacyRandomAccessFileWrapper::Read(unsigned long,
unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*,
rocksdb::IODebugContext*) const+0x26) [0x55701b529396]
7: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned
long, rocksdb::Slice*, char*, bool) const+0xdc7) [0x55701b745267]
8: (rocksdb::BlockFetcher::ReadBlockContents()+0x4b5) [0x55701b69fa45]
9: (rocksdb::Status
rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*,
rocksdb::ReadOptions const&, rocksdb::BlockHandle const&,
rocksdb::UncompressionDict const&,
rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*,
rocksdb::BlockType, rocksdb::GetContext*,
rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const+0x919)
[0x55701b691ba9]
10: (rocksdb::Status
rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*,
rocksdb::ReadOptions const&, rocksdb::BlockHandle const&,
rocksdb::UncompressionDict const&,
rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*,
rocksdb::BlockType, rocksdb::GetContext*,
rocksdb::BlockCacheLookupContext*, bool, bool) const+0x286) [0x55701b691f86]
11:
(rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable
const*, rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, bool,
rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*,
rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*)+0xf1)
[0x55701b760891]
12:
(rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool,
rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*,
rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*) const+0xfe)
[0x55701b760b5e]
13: (rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&,
bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*)
const+0x43) [0x55701b69aae3]
14:
(rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::ReadOptions
const&, rocksdb::FilterBlockReader*, rocksdb::Slice const&, bool,
rocksdb::SliceTransform const*, rocksdb::GetContext*,
rocksdb::BlockCacheLookupContext*) const+0x9e) [0x55701b68118e]
15: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&,
rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform
const*, bool)+0x180) [0x55701b6829a0]
16: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&,
rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&,
rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform
const*, rocksdb::HistogramImpl*, bool, int)+0x170) [0x55701b5e2370]
17: (rocksdb::Version::Get(rocksdb::ReadOptions const&,
rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*,
rocksdb::MergeContext*, unsigned long*, bool*, bool*, unsigned long*,
rocksdb::ReadCallback*, bool*, bool)+0x3ad) [0x55701b5ffbbd]
18: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&,
rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions)+0x537)
[0x55701b50fbe7]
19: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&,
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
rocksdb::PinnableSlice*)+0x4d) [0x55701b5104fd]
20: (RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x35b)
[0x55701b4d617b]
21:
(BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)+0xf7)
[0x55701af50b57]
22: (BlueStore::_wctx_finish(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*,
std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>,
std::allocator<BlueStore::SharedBlob*> >*)+0xeaf) [0x55701af908cf]
23: (BlueStore::_do_truncate(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>, unsigned long,
std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>,
std::allocator<BlueStore::SharedBlob*> >*)+0x3a8) [0x55701af91f68]
24: (BlueStore::_do_remove(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>)+0xce) [0x55701af9289e]
25: (BlueStore::_remove(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>&)+0x22c) [0x55701af9441c]
26: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
ceph::os::Transaction*)+0x1fcf) [0x55701afb8dbf]
27:
(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction>
>&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316)
[0x55701afba1a6]
28:
(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>,
ThreadPool::TPHandle*)+0x85) [0x55701aad7765]
29: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap
const>, ThreadPool::TPHandle*)+0xf3) [0x55701aa6c6d3]
30: (OSD::dequeue_peering_evt(OSDShard*, PG*,
std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2d8)
[0x55701aa9ee98]
31: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int,
ThreadPool::TPHandle&)+0xc8) [0x55701aa9f038]
32: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xc28) [0x55701aa90d48]
33: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
[0x55701b1025b4]
34: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55701b105254]
35: /lib64/libpthread.so.0(+0x817a) [0x7fdf00a6e17a]
36: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
"
Some errors seen in the journal one the host:
"
Aug 04 06:57:59 dcn-ceph-01 bash[5683]: debug
2022-08-04T06:57:59.735+0000 7fa687612700 -1
librbd::SnapshotRemoveRequest: 0x559541832780 handle_trash_snap: failed
to move snapshot to trash: (16) Device or resource busy
Aug 04 06:57:59 dcn-ceph-01 bash[7700]: debug
2022-08-04T06:57:59.827+0000 7f6f7695e700 -1 osd.25 138094
heartbeat_check: no reply from 172.21.15.55:6844 osd.100 since back
2022-08-04T06:57:22.475054+0000 front 2022-08-04T06:57:22.475002+0000
(oldest deadline 2022-08-04T06:57:46.573685+0000)
Aug 04 06:59:34 dcn-ceph-01 bash[8385]: debug
2022-08-04T06:59:34.398+0000 7f4b440c4700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f4b25676700' had timed out after 15.000000954s
Aug 04 06:59:31 dcn-ceph-01 bash[5230]: debug
2022-08-04T06:59:31.619+0000 7fab9ede5700 1 heartbeat_map reset_timeout
'Monitor::cpu_tp thread 0x7fab9ede5700' had timed out after 0.000000000s
Aug 04 06:59:29 dcn-ceph-01 bash[5230]: debug
2022-08-04T06:59:29.808+0000 7fab9cde1700 1 mon.dcn-ceph-01@4(electing)
e21 collect_metadata md0: no unique device id for md0: fallback method
has no model nor serial'
"
There are a lot of those 15s heartbeat timeouts.
I'm not sure what to look for. The cluster is quite busy but OSDs
shouldn't flap or crash like that? Perhaps some heartbeat/timeout values
could be tweaked?
Mvh.
Torkil
--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: torkil@xxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx