On Tue, Aug 15, 2017 at 7:03 AM Andreas Calminder <andreas.calminder@xxxxxxxxxx> wrote:
Hi,
I got hit with osd suicide timeouts while deep-scrub runs on a
specific pg, there's a RH article
(https://access.redhat.com/solutions/2127471) suggesting changing
osd_scrub_thread_suicide_timeout' from 60s to a higher value, problem
is the article is for Hammer and the osd_scrub_thread_suicide_timeout
doesn't exist when running
ceph daemon osd.34 config show
and the default timeout (60s) suggested in the article doesn't really
match the sucide timeout time in the logs:
2017-08-15 15:39:37.512216 7fb293137700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7fb231adf700' had suicide timed out after 150
2017-08-15 15:39:37.518543 7fb293137700 -1 common/HeartbeatMap.cc: In
function 'bool ceph::HeartbeatMap::_check(const
ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fb293137700
time 2017-08-15 15:39:37.512230
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
The suicide timeout (150) does match the
osd_op_thread_suicide_timeout, however when I try changing this I get:
ceph daemon osd.34 config set osd_op_thread_suicide_timeout 300
{
"success": "osd_op_thread_suicide_timeout = '300' (unchangeable) "
}
And the deep scrub will sucide timeout after 150 seconds, just like before.
The cluster is left with osd.34 flapping. Is there any way to let the
deep-scrub finish and get out of the infinite deep-scrub loop?
You can set that option in ceph.conf. It's "unchangeable" because it's used to initialize some other structures at boot so you can't edit it live.
Regards,
Andreas
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com