On Thu, May 22, 2014 at 4:09 AM, Kenneth Waegeman <Kenneth.Waegeman at ugent.be> wrote: > > ----- Message from Gregory Farnum <greg at inktank.com> --------- > Date: Wed, 21 May 2014 15:46:17 -0700 > > From: Gregory Farnum <greg at inktank.com> > Subject: Re: Expanding pg's of an erasure coded pool > To: Kenneth Waegeman <Kenneth.Waegeman at ugent.be> > Cc: ceph-users <ceph-users at lists.ceph.com> > > >> On Wed, May 21, 2014 at 3:52 AM, Kenneth Waegeman >> <Kenneth.Waegeman at ugent.be> wrote: >>> >>> Thanks! I increased the max processes parameter for all daemons quite a >>> lot >>> (until ulimit -u 3802720) >>> >>> These are the limits for the daemons now.. >>> [root@ ~]# cat /proc/17006/limits >>> Limit Soft Limit Hard Limit Units >>> Max cpu time unlimited unlimited >>> seconds >>> Max file size unlimited unlimited bytes >>> Max data size unlimited unlimited bytes >>> Max stack size 10485760 unlimited bytes >>> Max core file size unlimited unlimited bytes >>> Max resident set unlimited unlimited bytes >>> Max processes 3802720 3802720 >>> processes >>> Max open files 32768 32768 files >>> Max locked memory 65536 65536 bytes >>> Max address space unlimited unlimited bytes >>> Max file locks unlimited unlimited locks >>> Max pending signals 95068 95068 >>> signals >>> Max msgqueue size 819200 819200 bytes >>> Max nice priority 0 0 >>> Max realtime priority 0 0 >>> Max realtime timeout unlimited unlimited us >>> >>> But this didn't help. Are there other parameters I should change? >> >> >> Hrm, is it exactly the same stack trace? You might need to bump the >> open files limit as well, although I'd be surprised. :/ > > > I increased the open file limit as test to 128000, still the same results. > > Stack trace: <snip> > But I see some things happening on the system while doing this too: > > > > [root@ ~]# ceph osd pool set ecdata15 pgp_num 4096 > set pool 16 pgp_num to 4096 > [root@ ~]# ceph status > Traceback (most recent call last): > File "/usr/bin/ceph", line 830, in <module> > sys.exit(main()) > File "/usr/bin/ceph", line 590, in main > conffile=conffile) > File "/usr/lib/python2.6/site-packages/rados.py", line 198, in __init__ > librados_path = find_library('rados') > File "/usr/lib64/python2.6/ctypes/util.py", line 209, in find_library > return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name)) > File "/usr/lib64/python2.6/ctypes/util.py", line 203, in > _findSoname_ldconfig > os.popen('LANG=C /sbin/ldconfig -p 2>/dev/null').read()) > OSError: [Errno 12] Cannot allocate memory > [root@ ~]# lsof | wc > -bash: fork: Cannot allocate memory > [root@ ~]# lsof | wc > 21801 211209 3230028 > [root@ ~]# ceph status > ^CError connecting to cluster: InterruptedOrTimeoutError > ^[[A[root@ ~]# lsof | wc > 2028 17476 190947 > > > > And meanwhile the daemons has then been crashed. > > I verified the memory never ran out. Is there anything in dmesg? It sure looks like the OS thinks it's run out of memory one way or another. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com