Hi Christian, Your problem is probably that your kernel.pid_max (the maximum threads+processes across the entire system) needs to be increased - the default is 32768, which is too low for even a medium density deployment. You can test this easily enough with $ ps axms | wc -l If you get a number around the 30,000 mark then you are going to be affected. There's an issue here http://tracker.ceph.com/issues/6142 , although it doesn't seem to have gotten much traction in terms of informing users. Regards Nathan On 15/09/2014 7:13 PM, Christian Eichelmann wrote: > Hi all, > > I have no idea why running out of filehandles should produce a "out of > memory" error, but well. I've increased the ulimit as you told me, and > nothing changed. I've noticed that the osd init script sets the max open > file handles explicitly, so I was setting the corresponding option in my > ceph conf. Now the limits of an OSD process look like this: > > Limit Soft Limit Hard Limit > Units > Max cpu time unlimited unlimited > seconds > Max file size unlimited unlimited > bytes > Max data size unlimited unlimited > bytes > Max stack size 8388608 unlimited > bytes > Max core file size unlimited unlimited > bytes > Max resident set unlimited unlimited > bytes > Max processes 2067478 2067478 > processes > Max open files 65536 65536 > files > Max locked memory 65536 65536 > bytes > Max address space unlimited unlimited > bytes > Max file locks unlimited unlimited > locks > Max pending signals 2067478 2067478 > signals > Max msgqueue size 819200 819200 > bytes > Max nice priority 0 0 > Max realtime priority 0 0 > Max realtime timeout unlimited unlimited us > > Anyways, the exact same behavior as before. I was also finding a mailing > on this list from someone who had the exact same problem: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040059.html > > Unfortunately, there was also no real solution for this problem. > > So again: this is *NOT* a ulimit issue. We were running emperor and > dumpling on the same hardware without any issues. They first started > after our upgrade to firefly. > > Regards, > Christian > > > Am 12.09.2014 18:26, schrieb Christian Balzer: >> On Fri, 12 Sep 2014 12:05:06 -0400 Brian Rak wrote: >> >>> That's not how ulimit works. Check the `ulimit -a` output. >>> >> Indeed. >> >> And to forestall the next questions, see "man initscript", mine looks like >> this: >> --- >> ulimit -Hn 131072 >> ulimit -Sn 65536 >> >> # Execute the program. >> eval exec "$4" >> --- >> >> And also a /etc/security/limits.d/tuning.conf (debian) like this: >> --- >> root soft nofile 65536 >> root hard nofile 131072 >> * soft nofile 16384 >> * hard nofile 65536 >> --- >> >> Adjusted to your actual needs. There might be other limits you're hitting, >> but that is the most likely one >> >> Also 45 OSDs with 12 (24 with HT, bleah) CPU cores is pretty ballsy. >> I personally would rather do 4 RAID6 (10 disks, with OSD SSD journals) >> with that kind of case and enjoy the fact that my OSDs never fail. ^o^ >> >> Christian (another one) >> >> >>> On 9/12/2014 10:15 AM, Christian Eichelmann wrote: >>>> Hi, >>>> >>>> I am running all commands as root, so there are no limits for the >>>> processes. >>>> >>>> Regards, >>>> Christian >>>> _______________________________________ >>>> Von: Mariusz Gronczewski [mariusz.gronczewski at efigence.com] >>>> Gesendet: Freitag, 12. September 2014 15:33 >>>> An: Christian Eichelmann >>>> Cc: ceph-users at lists.ceph.com >>>> Betreff: Re: OSDs are crashing with "Cannot fork" or >>>> "cannot create thread" but plenty of memory is left >>>> >>>> do cat /proc/<pid>/limits >>>> >>>> probably you hit max processes limit or max FD limit >>>> >>>>> Hi Ceph-Users, >>>>> >>>>> I have absolutely no idea what is going on on my systems... >>>>> >>>>> Hardware: >>>>> 45 x 4TB Harddisks >>>>> 2 x 6 Core CPUs >>>>> 256GB Memory >>>>> >>>>> When initializing all disks and join them to the cluster, after >>>>> approximately 30 OSDs, other osds are crashing. When I try to start >>>>> them again I see different kinds of errors. For example: >>>>> >>>>> >>>>> Starting Ceph osd.316 on ceph-osd-bs04...already running >>>>> === osd.317 === >>>>> Traceback (most recent call last): >>>>> File "/usr/bin/ceph", line 830, in <module> >>>>> sys.exit(main()) >>>>> File "/usr/bin/ceph", line 773, in main >>>>> sigdict, inbuf, verbose) >>>>> File "/usr/bin/ceph", line 420, in new_style_command >>>>> inbuf=inbuf) >>>>> File "/usr/lib/python2.7/dist-packages/ceph_argparse.py", line >>>>> 1112, in json_command >>>>> raise RuntimeError('"{0}": exception {1}'.format(cmd, e)) >>>>> NameError: global name 'cmd' is not defined >>>>> Exception thread.error: error("can't start new thread",) in <bound >>>>> method Rados.__del__ of <rados.Rados object >>>>> at 0x29ee410>> ignored >>>>> >>>>> >>>>> or: >>>>> /etc/init.d/ceph: 190: /etc/init.d/ceph: Cannot fork >>>>> /etc/init.d/ceph: 191: /etc/init.d/ceph: Cannot fork >>>>> /etc/init.d/ceph: 192: /etc/init.d/ceph: Cannot fork >>>>> >>>>> or: >>>>> /usr/bin/ceph-crush-location: 72: /usr/bin/ceph-crush-location: >>>>> Cannot fork /usr/bin/ceph-crush-location: >>>>> 79: /usr/bin/ceph-crush-location: Cannot fork Thread::try_create(): >>>>> pthread_create failed with error 11common/Thread.cc: In function >>>>> 'void Thread::create(size_t)' thread 7fcf768c9760 time 2014-09-12 >>>>> 15:00:28.284735 common/Thread.cc: 110: FAILED assert(ret == 0) >>>>> ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6) >>>>> 1: /usr/bin/ceph-conf() [0x51de8f] >>>>> 2: (CephContext::CephContext(unsigned int)+0xb1) [0x520fe1] >>>>> 3: (common_preinit(CephInitParameters const&, code_environment_t, >>>>> int)+0x48) [0x52eb78] >>>>> 4: (global_pre_init(std::vector<char const*, std::allocator<char >>>>> const*> >*, std::vector<char const*, std::allocator<char const*> >&, >>>>> unsigned int, code_environment_t, int)+0x8d) [0x518d0d] >>>>> 5: (main()+0x17a) [0x514f6a] >>>>> 6: (__libc_start_main()+0xfd) [0x7fcf7522ceed] >>>>> 7: /usr/bin/ceph-conf() [0x5168d1] >>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is >>>>> needed to interpret this. >>>>> terminate called after throwing an instance of 'ceph::FailedAssertion' >>>>> Aborted (core dumped) >>>>> /etc/init.d/ceph: 340: /etc/init.d/ceph: Cannot fork >>>>> /etc/init.d/ceph: 1: /etc/init.d/ceph: Cannot fork >>>>> Traceback (most recent call last): >>>>> File "/usr/bin/ceph", line 830, in <module> >>>>> sys.exit(main()) >>>>> File "/usr/bin/ceph", line 590, in main >>>>> conffile=conffile) >>>>> File "/usr/lib/python2.7/dist-packages/rados.py", line 198, in >>>>> __init__ librados_path = find_library('rados') >>>>> File "/usr/lib/python2.7/ctypes/util.py", line 224, in find_library >>>>> return _findSoname_ldconfig(name) or >>>>> _get_soname(_findLib_gcc(name)) File >>>>> "/usr/lib/python2.7/ctypes/util.py", line 213, in _findSoname_ldconfig >>>>> f = os.popen('/sbin/ldconfig -p 2>/dev/null') >>>>> OSError: [Errno 12] Cannot allocate memory >>>>> >>>>> But anyways, when I look at the memory consumption of the system: >>>>> # free -m >>>>> total used free shared buffers >>>>> cached Mem: 258450 25841 232609 0 >>>>> 18 15506 -/+ buffers/cache: 10315 248135 >>>>> Swap: 3811 0 3811 >>>>> >>>>> >>>>> There are more then 230GB of memory available! What is going on there? >>>>> System: >>>>> Linux ceph-osd-bs04 3.14-0.bpo.1-amd64 #1 SMP Debian 3.14.12-1~bpo70+1 >>>>> (2014-07-13) x86_64 GNU/Linux >>>>> >>>>> Since this is happening on other Hardware as well, I don't think it's >>>>> Hardware related. I have no Idea if this is an OS issue (which would >>>>> be seriously strange) or a ceph issue. >>>>> >>>>> Since this is happening only AFTER we upgraded to firefly, I guess it >>>>> has something to do with ceph. >>>>> >>>>> ANY idea on what is going on here would be very appreciated! >>>>> >>>>> Regards, >>>>> Christian >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users at lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> -- >>>> Mariusz Gronczewski, Administrator >>>> >>>> Efigence S. A. >>>> ul. Wo?oska 9a, 02-583 Warszawa >>>> T: [+48] 22 380 13 13 >>>> F: [+48] 22 380 13 14 >>>> E: mariusz.gronczewski at efigence.com >>>> <mailto:mariusz.gronczewski at efigence.com> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users at lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users at lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >