Re: Jewel (10.2.7) osd suicide timeout while deep-scrub

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 17, 2017 at 12:14 AM Andreas Calminder <andreas.calminder@xxxxxxxxxx> wrote:
Thanks,
I've modified the timeout successfully, unfortunately it wasn't enough
for the deep-scrub to finish, so I increased the
osd_op_thread_suicide_timeout even higher (1200s), the deep-scrub
command will however get killed before this timeout is reached, I
figured it was osd_command_thread_suicide_timeout and adjusted it
accordingly and restarted the osd, but it still got killed
approximately 900s after starting.

The log spits out:
2017-08-17 09:01:35.723865 7f062e696700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f05cceee700' had timed out after 15
2017-08-17 09:01:40.723945 7f062e696700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f05cceee700' had timed out after 15
2017-08-17 09:01:45.012105 7f05cceee700  1 heartbeat_map reset_timeout
'OSD::osd_op_tp thread 0x7f05cceee700' had timed out after 15

I'm thinking having an osd in a cluster locked for ~900s maybe isn't
the best thing, is there any way of doing this deep-scrub operation
"offline" or in some way that wont affect or get affected by the rest
of the cluster?

Deep scrub actually timing out a thread is pretty weird anyway — I think it requires some combination of abnormally large objects/omap indexes and buggy releases.

Is there any more information in the log about the thread that's timing out? What's leading you to believe it's the deep scrub? What kind of data is in the pool?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux