Re: crashing OSDs with FAILED ceph_assert

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Denis,

please share OSD log output preceding the assertion. It usually has some helpful information, e.g. error code, about the root cause.


Thanks,

Igor

On 3/12/2022 5:01 PM, Denis Polom wrote:
Hi,

I have Ceph cluster version Pacific 16.2.7 with RBD pool and OSDs made on SSDs with DB on separete NVMe.

What I observe OSDs are crashing randomly. Output of crash info is:


{
    "archived": "2022-03-12 11:44:37.251897",
    "assert_condition": "r == 0",
    "assert_file": "/build/ceph-16.2.7/src/blk/kernel/KernelDevice.cc",
    "assert_func": "virtual void KernelDevice::aio_submit(IOContext*)",
    "assert_line": 826,
    "assert_msg": "/build/ceph-16.2.7/src/blk/kernel/KernelDevice.cc: In function 'virtual void KernelDevice::aio_submit(IOContext*)' thread 7f0e5c84c700 time 2022-03-12T10:15:36.092656+0100\n/build/ceph-16.2.7/src/blk/kernel/KernelDevice.cc: 826: FAILED ceph_assert(r == 0)\n",
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "/lib/x86_64-linux-gnu/libpthread.so.0(+0x128f0) [0x7f0e91d428f0]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19c) [0x55a60fc646ec]",         "(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55a60fc64876]",
        "(KernelDevice::aio_submit(IOContext*)+0x70b) [0x55a61074f80b]",
"(BlueStore::_txc_aio_submit(BlueStore::TransContext*)+0x45) [0x55a610208825]", "(BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x5ec) [0x55a610208f5c]", "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allo cator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x962) [0x55a61024c4b2]",         "(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55a60fe783d4]", "(ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xb18) [0x55a610078b58]", "(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x1a7) [0x55a6100898a7]", "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x55a60fec95d7]", "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x6fd) [0x55a60fe662ad]",         "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b) [0x55a60fceb52b]",         "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6a) [0x55a60ff4e84a]",         "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xd1e) [0x55a60fd0959e]",         "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x55a610387eac]",         "(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a61038b160]",         "/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0e91d376db]",
        "clone()"
    ],
    "ceph_version": "16.2.7",
    "crash_id": "2022-03-12T09:15:36.167442Z_37b6fbae-de81-4d07-8cf8-d81eea5e150a",
    "entity_name": "osd.48",
    "os_id": "ubuntu",
    "os_name": "Ubuntu",
    "os_version": "18.04.6 LTS (Bionic Beaver)",
    "os_version_id": "18.04",
    "process_name": "ceph-osd",
    "stack_sig": "1461d5ef23cacdccd0bb5b940315466572a2015d841e5517411623b9c5e5d357",
    "timestamp": "2022-03-12T09:15:36.167442Z",
    "utsname_hostname": "ceph4",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.0-89-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#100~18.04.1-Ubuntu SMP Wed Sep 29 10:59:42 UTC 2021"
}


I found some reported issues with this but didn't find any solution:

https://tracker.ceph.com/issues/20381 recommends to increase aio max queue depth. Which default is 1024. I've increased the value to 4096, but issue persists.

I checked fragmentation ratio on OSDs and it looks pretty high:

"fragmentation_rating": 0.8454254911820045

My bluefs_alloc_size value is default 1MB:

"bluefs_alloc_size": "1048576"


Need help here if if would be correct to adjust this value and / or some others? Or if problem may be somewhere else?

Thank you!


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux