Re: osds crashing on Thread::create

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



i just checked several of the osds running in the environment and the hard and soft limits for the number of processes is set to 257486. if its exceeding that, than it seems like there would still be a bug somewhere. i can't imagine it needing that many.

$ for N in `pidof ceph-osd`; do echo ${N}; sudo grep processes /proc/${N}/limits; done
8761
Max processes             257486               257486               processes 
7744
Max processes             257486               257486               processes 
5536
Max processes             257486               257486               processes 
4717
Max processes             257486               257486               processes 

i did go looking through the ceph init script and didn't find where that was getting set and no reference to setrlimit in the code. so i'm not sure how that gets set.

this did lead me into looking at how many threads were getting created per process and how many there were total on the system. it looks like there are a total of just over 30k total tasks (pids and threads) on the systems. i just set kernel.pid_max to 64k and will keep an eye on it. it would make sense that this is the problem. i'm a little surprised to see it get this close with only 12 osds running. it looks like they're creating over 2500 threads each. i don't know the internals of the code but that seems like a lot. oh well. hopefully this fixes it.

mike

On Mon, Mar 7, 2016 at 1:55 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
On Mon, Mar 7, 2016 at 11:04 AM, Mike Lovell <mike.lovell@xxxxxxxxxxxxx> wrote:
> first off, hello all. this is my first time posting to the list.
>
> i have seen a recurring problem that has starting in the past week or so on
> one of my ceph clusters. osds will crash and it seems to happen whenever
> backfill or recovery is started. looking at the logs it appears that the the
> osd is asserting in src/common/Thread.cc when it tries to create a new
> thread. these osds are running 0.94.5 and i believe
> https://github.com/ceph/ceph/blob/v0.94.5/src/common/Thread.cc#L129 is the
> assert that is being hit. i looked back through the code for a couple
> minutes and it looks like its asserting on pthread_create returning
> something besides 0. i'm not sure why pthread_create would be failing and it
> looks like it just writes what the return code is to stderr. i also wasn't
> able to determine where the output of stderr ended up from my osds. it looks
> like from looking at /proc/<pid>/fd/{0,1,2} and lsof that stderr is a unix
> socket but i don't see where it goes after that. the osds are started by
> ceph-disk activate.
>
> do any of you have any ideas as to what might be causing this? or how i
> might further troubleshoot this? i'm attaching a trimmed version of the osd
> log. i removed some extraneous bits from after the osds was restarted and a
> large amount of 'recent events' that were from well before the crash.

Usually you just need to increase the ulimits for thread/process
counts, on the ceph user account or on the system as a whole. Check
the docs and the startup scripts.
-Greg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux