On Mon, Mar 7, 2016 at 11:04 AM, Mike Lovell <mike.lovell@xxxxxxxxxxxxx> wrote: > first off, hello all. this is my first time posting to the list. > > i have seen a recurring problem that has starting in the past week or so on > one of my ceph clusters. osds will crash and it seems to happen whenever > backfill or recovery is started. looking at the logs it appears that the the > osd is asserting in src/common/Thread.cc when it tries to create a new > thread. these osds are running 0.94.5 and i believe > https://github.com/ceph/ceph/blob/v0.94.5/src/common/Thread.cc#L129 is the > assert that is being hit. i looked back through the code for a couple > minutes and it looks like its asserting on pthread_create returning > something besides 0. i'm not sure why pthread_create would be failing and it > looks like it just writes what the return code is to stderr. i also wasn't > able to determine where the output of stderr ended up from my osds. it looks > like from looking at /proc/<pid>/fd/{0,1,2} and lsof that stderr is a unix > socket but i don't see where it goes after that. the osds are started by > ceph-disk activate. > > do any of you have any ideas as to what might be causing this? or how i > might further troubleshoot this? i'm attaching a trimmed version of the osd > log. i removed some extraneous bits from after the osds was restarted and a > large amount of 'recent events' that were from well before the crash. Usually you just need to increase the ulimits for thread/process counts, on the ceph user account or on the system as a whole. Check the docs and the startup scripts. -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com