Hello
I have 3 OSDs that are stuck in a perpetual loop of
heartbeat_map is_healthy ... had timed out after 15.000000954s
<repeated many, many times>
heartbeat_map is_healthy ... had suicide timed out after
150.000000000s
*** Caught signal (Aborted) **
This began happening some time after I had moved a pool off
these OSDs. Now the pools that still use these 3 OSDs are in
trouble and I don't know how to resolve this situation. I am
running 16.2.7.
Can anybody help?
Not sure if it's relevant but these OSDs have a custom
device class "nvme", so there is this line in the logs:
7fe6308fc080 -1 osd.74 15842 mon_cmd_maybe_osd_create fail:
'osd.74 has already bound to class 'nvme', can not reset
class to 'ssd'; use 'ceph osd crush rm-device-class <id>' to
remove old class
first': (16) Device or resource busy
I tried to set the following in ceph.conf, but it didn't
seem to make a difference.
[osd]
osd_max_scrubs = 0
osd_heartbeat_grace = 200
osd_scrub_thread_suicide_timeout = 600
osd_op_thread_suicide_timeout = 1500
osd_command_thread_suicide_timeout = 9000
Thanks,
Vlad
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx