Hi Igor
I would say that randomly almost all OSDs.
There is any error message in kernel log and smartctl shows disks as
healthy.
# smartctl -l error /dev/sdh
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.4.0-89-generic] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged
Thx!
On 3/12/22 17:06, Igor Fedotov wrote:
Denis,
may be there is something interesting in dmesg or smartctl output?
Are all OSDs/nodes in the cluster affected?
When does that start to happen? How often?
Thanks,
Igor
On 3/12/2022 6:14 PM, Denis Polom wrote:
Hi Igor,
before the assertion there is
2022-03-12T10:15:35.879+0100 7f0e61055700 -1 bdev(0x55a61c6a6000
/var/lib/ceph/osd/ceph-48/block) aio_submit retries 5
2022-03-12T10:15:35.883+0100 7f0e6d06d700 -1 bdev(0x55a61c6a6000
/var/lib/ceph/osd/ceph-48/block) aio_submit retries 2
2022-03-12T10:15:35.951+0100 7f0e61856700 -1 bdev(0x55a61c6a6000
/var/lib/ceph/osd/ceph-48/block) aio_submit retries 7
2022-03-12T10:15:35.967+0100 7f0e5a047700 -1 bdev(0x55a61c6a6000
/var/lib/ceph/osd/ceph-48/block) aio_submit retries 110
2022-03-12T10:15:36.015+0100 7f0e5a848700 -1 bdev(0x55a61c6a6000
/var/lib/ceph/osd/ceph-48/block) aio_submit retries 34
2022-03-12T10:15:36.031+0100 7f0e6605f700 -1 bdev(0x55a61c6a6000
/var/lib/ceph/osd/ceph-48/block) aio_submit retries 15
2022-03-12T10:15:36.087+0100 7f0e5c84c700 -1 bdev(0x55a61c6a6000
/var/lib/ceph/osd/ceph-48/block) aio_submit retries 43
2022-03-12T10:15:36.087+0100 7f0e5c84c700 -1 bdev(0x55a61c6a6000
/var/lib/ceph/osd/ceph-48/block) aio submit got (11) Resource
temporarily unavailable
let me know if you need more.
Thank you!
On 3/12/22 15:14, Igor Fedotov wrote:
Hi Denis,
please share OSD log output preceding the assertion. It usually has
some helpful information, e.g. error code, about the root cause.
Thanks,
Igor
On 3/12/2022 5:01 PM, Denis Polom wrote:
Hi,
I have Ceph cluster version Pacific 16.2.7 with RBD pool and OSDs
made on SSDs with DB on separete NVMe.
What I observe OSDs are crashing randomly. Output of crash info is:
{
"archived": "2022-03-12 11:44:37.251897",
"assert_condition": "r == 0",
"assert_file":
"/build/ceph-16.2.7/src/blk/kernel/KernelDevice.cc",
"assert_func": "virtual void
KernelDevice::aio_submit(IOContext*)",
"assert_line": 826,
"assert_msg":
"/build/ceph-16.2.7/src/blk/kernel/KernelDevice.cc: In function
'virtual void KernelDevice::aio_submit(IOContext*)' thread
7f0e5c84c700 time
2022-03-12T10:15:36.092656+0100\n/build/ceph-16.2.7/src/blk/kernel/KernelDevice.cc:
826: FAILED ceph_assert(r == 0)\n",
"assert_thread_name": "tp_osd_tp",
"backtrace": [
"/lib/x86_64-linux-gnu/libpthread.so.0(+0x128f0)
[0x7f0e91d428f0]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x19c) [0x55a60fc646ec]",
"(ceph::__ceph_assertf_fail(char const*, char const*, int,
char const*, char const*, ...)+0) [0x55a60fc64876]",
"(KernelDevice::aio_submit(IOContext*)+0x70b)
[0x55a61074f80b]",
"(BlueStore::_txc_aio_submit(BlueStore::TransContext*)+0x45)
[0x55a610208825]",
"(BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x5ec)
[0x55a610208f5c]",
"(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
std::vector<ceph::os::Transaction, std::allo
cator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>,
ThreadPool::TPHandle*)+0x962) [0x55a61024c4b2]",
"(non-virtual thunk to
PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction,
std::allocator<ceph::os::Transaction> >&,
boost::intrusive_ptr<OpRequest>)+0x54) [0x55a60fe783d4]",
"(ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xb18)
[0x55a610078b58]",
"(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x1a7)
[0x55a6100898a7]",
"(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97)
[0x55a60fec95d7]",
"(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x6fd) [0x55a60fe662ad]",
"(OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b)
[0x55a60fceb52b]",
"(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6a)
[0x55a60ff4e84a]",
"(OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xd1e) [0x55a60fd0959e]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac)
[0x55a610387eac]",
"(ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0x55a61038b160]",
"/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)
[0x7f0e91d376db]",
"clone()"
],
"ceph_version": "16.2.7",
"crash_id":
"2022-03-12T09:15:36.167442Z_37b6fbae-de81-4d07-8cf8-d81eea5e150a",
"entity_name": "osd.48",
"os_id": "ubuntu",
"os_name": "Ubuntu",
"os_version": "18.04.6 LTS (Bionic Beaver)",
"os_version_id": "18.04",
"process_name": "ceph-osd",
"stack_sig":
"1461d5ef23cacdccd0bb5b940315466572a2015d841e5517411623b9c5e5d357",
"timestamp": "2022-03-12T09:15:36.167442Z",
"utsname_hostname": "ceph4",
"utsname_machine": "x86_64",
"utsname_release": "5.4.0-89-generic",
"utsname_sysname": "Linux",
"utsname_version": "#100~18.04.1-Ubuntu SMP Wed Sep 29 10:59:42
UTC 2021"
}
I found some reported issues with this but didn't find any solution:
https://tracker.ceph.com/issues/20381 recommends to increase aio
max queue depth. Which default is 1024. I've increased the value to
4096, but issue persists.
I checked fragmentation ratio on OSDs and it looks pretty high:
"fragmentation_rating": 0.8454254911820045
My bluefs_alloc_size value is default 1MB:
"bluefs_alloc_size": "1048576"
Need help here if if would be correct to adjust this value and / or
some others? Or if problem may be somewhere else?
Thank you!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx