Probably worth filing a bug. Make sure to include the usual stuff: 1) version 2) logs from a crashing osd For this one, it would also be handy if you used gdb to dump the thread backtraces for an osd which is experiencing "an increase of approximately 230-260 threads for every other OSD node" -Sam On Mon, Dec 7, 2015 at 1:37 PM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: > Hi Greg, > the node reboot unexpectedly. The timeline goes like this according to > ceph cluster logs: > 12:36:56 - 12:37:02 osds reported down > 12:42:00 - 12:42:05 osds reported out > 13:50:44 - 13:50:49 osds booted again > > The thread count in all other OSD nodes was ramping up from 12:36 > until appr. 14:00 > > The cluster recovered at about 16:20. I have not restarted any OSD > till now. Nothing else happened in the cluster in the meanwhile. There > was no ERR/WRN in cluster's log. > > Regards, > Kostis > > On 7 December 2015 at 17:08, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: >> On Mon, Dec 7, 2015 at 6:59 AM, Kostis Fardelas <dante1234@xxxxxxxxx> wrote: >>> Hi cephers, >>> after one OSD node crash (6 OSDs in total), we experienced an increase >>> of approximately 230-260 threads for every other OSD node. We have 26 >>> OSD nodes with 6 OSDs per node, so this is approximately 40 threads >>> per osd. The OSD node has joined the cluster after 15-20 minutes. >>> >>> The only workaround I have found so far is to restart the OSDs of the >>> cluster, but this is a quite heavy operation. Could you help me >>> understand if the behaviour described above is an expected one and >>> what could be the reason for this? Does ceph cleanup appropriately osd >>> processes threads? >>> >>> Extra info: all threads are in sleeping state right now and context >>> switches have been stabilized at the pre-crash levels >> >> Can you describe exactly what you observed with time intervals? Eg: >> did the OSDs get restarted after crashing, and how did the thread >> counts relate to that. Did anything else happen in the cluster while >> this was happening. How long did you wait before you began restarting >> OSDs to reduce the thread counts. >> -Greg > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com