Re: osd process threads stack up on osds failure

Kostis Fardelas <dante1234@xxxxxxxxx> · Mon, 7 Dec 2015 23:37:27 +0200

Hi Greg,
the node reboot unexpectedly. The timeline goes like this according to
ceph cluster logs:
12:36:56 - 12:37:02 osds reported down
12:42:00 - 12:42:05 osds reported out
13:50:44 - 13:50:49 osds booted again

The thread count in all other OSD nodes was ramping up from 12:36
until appr. 14:00

The cluster recovered at about 16:20. I have not restarted any OSD
till now. Nothing else happened in the cluster in the meanwhile. There
was no ERR/WRN in cluster's log.

Regards,
Kostis

On 7 December 2015 at 17:08, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Mon, Dec 7, 2015 at 6:59 AM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote:
>> Hi cephers,
>> after one OSD node crash (6 OSDs in total), we experienced an increase
>> of approximately 230-260 threads for every other OSD node. We have 26
>> OSD nodes with 6 OSDs per node, so this is approximately 40 threads
>> per osd. The OSD node has joined the cluster after 15-20 minutes.
>>
>> The only workaround I have found so far is to restart the OSDs of the
>> cluster, but this is a quite heavy operation. Could you help me
>> understand if the behaviour described above is an expected one and
>> what could be the reason for this? Does ceph cleanup appropriately osd
>> processes threads?
>>
>> Extra info: all threads are in sleeping state right now and context
>> switches have been stabilized at the pre-crash levels
>
> Can you describe exactly what you observed with time intervals? Eg:
> did the OSDs get restarted after crashing, and how did the thread
> counts relate to that. Did anything else happen in the cluster while
> this was happening. How long did you wait before you began restarting
> OSDs to reduce the thread counts.
> -Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com