Ceph 16.2.14: osd crash, bdev() _aio_thread got r=-1 ((1) Operation not permitted)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

One of our 16.2.14 cluster OSDs crashed again because of the dreaded
https://tracker.ceph.com/issues/53906 bug. Usually an OSD, which crashed
because of this bug, restarts within seconds and continues normal
operation. This time it failed to restart and kept crashing:

    "assert_condition": "abort",
    "assert_file":
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.14/rpm/el8/BUILD/ceph-16.2.14/src/blk/kernel/KernelDevice.cc",
    "assert_func": "void KernelDevice::_aio_thread()",
    "assert_line": 604,
    "assert_msg":
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.14/rpm/el8/BUILD/ceph-16.2.14/src/blk/kernel/KernelDevice.cc:
In function 'void KernelDevice::_aio_thread()' thread 7f08520e2700 time
2023-12-03T04:00:36.689614+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.14/rpm/el8/BUILD/ceph-16.2.14/src/blk/kernel/KernelDevice.cc:
604: ceph_abort_msg(\"Unexpected IO error. This may suggest HW issue.
Please check your dmesg!\")\n",
    "assert_thread_name": "bstore_aio",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12cf0) [0x7f085e308cf0]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0x1b6) [0x55f01d9494cb]",
        "(KernelDevice::_aio_thread()+0x1285) [0x55f01e4b5c15]",
        "(KernelDevice::AioCompletionThread::entry()+0x11)
[0x55f01e4c0ee1]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f085e2fe1ca]",
        "clone()"
    ],

There was nothing in dmesg though and the block device looked healthy. I
took the OSD down, ran a long SMART test on its block drive, ran a read
test on the drive and found no issues. I tried restarting the OSD again and
found in its debug that it failed because of an
"2023-12-03T04:00:36.686+0000 7f08520e2700 -1 bdev(0x55f02a28a400
/var/lib/ceph/osd/ceph-56/block) _aio_thread got r=-1 ((1) Operation not
permitted)" error: https://pastebin.com/gDat6rfk

I remember hitting this previously:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/GYL72G3F4PPCSWG5STQ7WLUXTNNI676S/,
and this time a host reboot completely resolved the issue.

It would be good to understand what has triggered this condition and how it
can be resolved without rebooting the whole host. I would very much
appreciate any suggestions.

Best regards,
Zakhar
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux