Re: osd process threads stack up on osds failure

Samuel Just <sjust@xxxxxxxxxx> · Thu, 14 Jan 2016 11:11:42 -0800

Probably worth filing a bug.  Make sure to include the usual stuff:
1) version
2) logs from a crashing osd

For this one, it would also be handy if you used gdb to dump the
thread backtraces for an osd which is experiencing "an increase of
approximately 230-260 threads for every other OSD node"
-Sam

On Mon, Dec 7, 2015 at 1:37 PM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote:
> Hi Greg,
> the node reboot unexpectedly. The timeline goes like this according to
> ceph cluster logs:
> 12:36:56 - 12:37:02 osds reported down
> 12:42:00 - 12:42:05 osds reported out
> 13:50:44 - 13:50:49 osds booted again
>
> The thread count in all other OSD nodes was ramping up from 12:36
> until appr. 14:00
>
> The cluster recovered at about 16:20. I have not restarted any OSD
> till now. Nothing else happened in the cluster in the meanwhile. There
> was no ERR/WRN in cluster's log.
>
> Regards,
> Kostis
>
> On 7 December 2015 at 17:08, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>> On Mon, Dec 7, 2015 at 6:59 AM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote:
>>> Hi cephers,
>>> after one OSD node crash (6 OSDs in total), we experienced an increase
>>> of approximately 230-260 threads for every other OSD node. We have 26
>>> OSD nodes with 6 OSDs per node, so this is approximately 40 threads
>>> per osd. The OSD node has joined the cluster after 15-20 minutes.
>>>
>>> The only workaround I have found so far is to restart the OSDs of the
>>> cluster, but this is a quite heavy operation. Could you help me
>>> understand if the behaviour described above is an expected one and
>>> what could be the reason for this? Does ceph cleanup appropriately osd
>>> processes threads?
>>>
>>> Extra info: all threads are in sleeping state right now and context
>>> switches have been stabilized at the pre-crash levels
>>
>> Can you describe exactly what you observed with time intervals? Eg:
>> did the OSDs get restarted after crashing, and how did the thread
>> counts relate to that. Did anything else happen in the cluster while
>> this was happening. How long did you wait before you began restarting
>> OSDs to reduce the thread counts.
>> -Greg
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com