Hi Greg, the node reboot unexpectedly. The timeline goes like this according to ceph cluster logs: 12:36:56 - 12:37:02 osds reported down 12:42:00 - 12:42:05 osds reported out 13:50:44 - 13:50:49 osds booted again The thread count in all other OSD nodes was ramping up from 12:36 until appr. 14:00 The cluster recovered at about 16:20. I have not restarted any OSD till now. Nothing else happened in the cluster in the meanwhile. There was no ERR/WRN in cluster's log. Regards, Kostis On 7 December 2015 at 17:08, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > On Mon, Dec 7, 2015 at 6:59 AM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: >> Hi cephers, >> after one OSD node crash (6 OSDs in total), we experienced an increase >> of approximately 230-260 threads for every other OSD node. We have 26 >> OSD nodes with 6 OSDs per node, so this is approximately 40 threads >> per osd. The OSD node has joined the cluster after 15-20 minutes. >> >> The only workaround I have found so far is to restart the OSDs of the >> cluster, but this is a quite heavy operation. Could you help me >> understand if the behaviour described above is an expected one and >> what could be the reason for this? Does ceph cleanup appropriately osd >> processes threads? >> >> Extra info: all threads are in sleeping state right now and context >> switches have been stabilized at the pre-crash levels > > Can you describe exactly what you observed with time intervals? Eg: > did the OSDs get restarted after crashing, and how did the thread > counts relate to that. Did anything else happen in the cluster while > this was happening. How long did you wait before you began restarting > OSDs to reduce the thread counts. > -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com