Re: Jewel (10.2.7) osd suicide timeout while deep-scrub

Andreas Calminder <andreas.calminder@xxxxxxxxxx> · Tue, 5 Sep 2017 07:17:32 +0200

Hi!
Thanks for the pointer about leveldb_compact_on_mount, it took a while
to get everything compacted but after that the deep scrub of the
offending pg went smooth without any suicides. I'm considering using
the compact on mount feature for all our osd's in the cluster since
they're kind of large and thereby kind of slow, sas, but still.
Anyhow, thanks a lot for the help!

/andreas

On 17 August 2017 at 23:48, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Thu, Aug 17, 2017 at 1:02 PM, Andreas Calminder
> <andreas.calminder@xxxxxxxxxx> wrote:
>> Hi!
>> Thanks for getting back to me!
>>
>> Clients access the cluster through rgw (s3), we had some big buckets
>> containing a lot of small files. Prior to this happening I removed a
>> semi-stale bucket with a rather large index, 2.5 million objects, all but 30
>> objects didn't actually exist which left the normal radosgw-admin bucket rm
>> command to fail so I had to remove the bucket instances and bucket metadata
>> by hand, leaving the remaining 30 objects floating around in the cluster.
>>
>> I don't have access to the logs at the moment, but I see the deep-scrub
>> starting in the log for osd.34, after a while it starts with
>>
>> 1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread $THREADID' had timed out after 15
>>
>> the $THREADID seemingly is the same one as the deep scrub, after a while it
>> will suicide and a lot of operations will happen until the deep scrub tries
>> again for the same pg and the above repeats.
>>
>> The osd disk (we have 1 osd per disk) is rather large and pretty slow so it
>> might be that, but I think the behaviour should've been observed elsewhere
>> in the cluster as well since all osd disks are of the same type and size.
>>
>> One thought I had is to just kill the disk and re-add it since the data is
>> supposed to be replicated to 3 nodes in the cluster, but I kind of want to
>> find out what has happened and have it fixed.
>
> Ah. Some people have also found that compacting the leveldb store
> improves the situation a great deal. In most versions you can do this
> by setting "leveldb_compact_on_mount = true" in the OSD's config file
> and then restarting the daemon. You may also have admin socket
> commands available to trigger it.
>
> I'd try out those and then turn it on again with the high suicide
> timeout and see if things improve.
> -Greg
>
>
>>
>> /andreas
>>
>>
>> On 17 Aug 2017 20:21, "Gregory Farnum" <gfarnum@xxxxxxxxxx> wrote:
>>
>> On Thu, Aug 17, 2017 at 12:14 AM Andreas Calminder
>> <andreas.calminder@xxxxxxxxxx> wrote:
>>>
>>> Thanks,
>>> I've modified the timeout successfully, unfortunately it wasn't enough
>>> for the deep-scrub to finish, so I increased the
>>> osd_op_thread_suicide_timeout even higher (1200s), the deep-scrub
>>> command will however get killed before this timeout is reached, I
>>> figured it was osd_command_thread_suicide_timeout and adjusted it
>>> accordingly and restarted the osd, but it still got killed
>>> approximately 900s after starting.
>>>
>>> The log spits out:
>>> 2017-08-17 09:01:35.723865 7f062e696700  1 heartbeat_map is_healthy
>>> 'OSD::osd_op_tp thread 0x7f05cceee700' had timed out after 15
>>> 2017-08-17 09:01:40.723945 7f062e696700  1 heartbeat_map is_healthy
>>> 'OSD::osd_op_tp thread 0x7f05cceee700' had timed out after 15
>>> 2017-08-17 09:01:45.012105 7f05cceee700  1 heartbeat_map reset_timeout
>>> 'OSD::osd_op_tp thread 0x7f05cceee700' had timed out after 15
>>>
>>> I'm thinking having an osd in a cluster locked for ~900s maybe isn't
>>> the best thing, is there any way of doing this deep-scrub operation
>>> "offline" or in some way that wont affect or get affected by the rest
>>> of the cluster?
>>
>>
>> Deep scrub actually timing out a thread is pretty weird anyway — I think it
>> requires some combination of abnormally large objects/omap indexes and buggy
>> releases.
>>
>> Is there any more information in the log about the thread that's timing out?
>> What's leading you to believe it's the deep scrub? What kind of data is in
>> the pool?
>>
>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com