Hi, One of our 16.2.14 cluster OSDs crashed again because of the dreaded https://tracker.ceph.com/issues/53906 bug. Usually an OSD, which crashed because of this bug, restarts within seconds and continues normal operation. This time it failed to restart and kept crashing: "assert_condition": "abort", "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.14/rpm/el8/BUILD/ceph-16.2.14/src/blk/kernel/KernelDevice.cc", "assert_func": "void KernelDevice::_aio_thread()", "assert_line": 604, "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.14/rpm/el8/BUILD/ceph-16.2.14/src/blk/kernel/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f08520e2700 time 2023-12-03T04:00:36.689614+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.14/rpm/el8/BUILD/ceph-16.2.14/src/blk/kernel/KernelDevice.cc: 604: ceph_abort_msg(\"Unexpected IO error. This may suggest HW issue. Please check your dmesg!\")\n", "assert_thread_name": "bstore_aio", "backtrace": [ "/lib64/libpthread.so.0(+0x12cf0) [0x7f085e308cf0]", "gsignal()", "abort()", "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x55f01d9494cb]", "(KernelDevice::_aio_thread()+0x1285) [0x55f01e4b5c15]", "(KernelDevice::AioCompletionThread::entry()+0x11) [0x55f01e4c0ee1]", "/lib64/libpthread.so.0(+0x81ca) [0x7f085e2fe1ca]", "clone()" ], There was nothing in dmesg though and the block device looked healthy. I took the OSD down, ran a long SMART test on its block drive, ran a read test on the drive and found no issues. I tried restarting the OSD again and found in its debug that it failed because of an "2023-12-03T04:00:36.686+0000 7f08520e2700 -1 bdev(0x55f02a28a400 /var/lib/ceph/osd/ceph-56/block) _aio_thread got r=-1 ((1) Operation not permitted)" error: https://pastebin.com/gDat6rfk I remember hitting this previously: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/GYL72G3F4PPCSWG5STQ7WLUXTNNI676S/, and this time a host reboot completely resolved the issue. It would be good to understand what has triggered this condition and how it can be resolved without rebooting the whole host. I would very much appreciate any suggestions. Best regards, Zakhar _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx