On 2014.05.22 19:55, Gregory Farnum wrote: > On Thu, May 22, 2014 at 4:09 AM, Kenneth Waegeman > <Kenneth.Waegeman at ugent.be> wrote: >> ----- Message from Gregory Farnum <greg at inktank.com> --------- >> Date: Wed, 21 May 2014 15:46:17 -0700 >> >> From: Gregory Farnum <greg at inktank.com> >> Subject: Re: Expanding pg's of an erasure coded pool >> To: Kenneth Waegeman <Kenneth.Waegeman at ugent.be> >> Cc: ceph-users <ceph-users at lists.ceph.com> >> >> >>> On Wed, May 21, 2014 at 3:52 AM, Kenneth Waegeman >>> <Kenneth.Waegeman at ugent.be> wrote: >>>> Thanks! I increased the max processes parameter for all daemons quite a >>>> lot >>>> (until ulimit -u 3802720) >>>> >>>> These are the limits for the daemons now.. >>>> [root@ ~]# cat /proc/17006/limits >>>> Limit Soft Limit Hard Limit Units >>>> Max cpu time unlimited unlimited >>>> seconds >>>> Max file size unlimited unlimited bytes >>>> Max data size unlimited unlimited bytes >>>> Max stack size 10485760 unlimited bytes >>>> Max core file size unlimited unlimited bytes >>>> Max resident set unlimited unlimited bytes >>>> Max processes 3802720 3802720 >>>> processes >>>> Max open files 32768 32768 files >>>> Max locked memory 65536 65536 bytes >>>> Max address space unlimited unlimited bytes >>>> Max file locks unlimited unlimited locks >>>> Max pending signals 95068 95068 >>>> signals >>>> Max msgqueue size 819200 819200 bytes >>>> Max nice priority 0 0 >>>> Max realtime priority 0 0 >>>> Max realtime timeout unlimited unlimited us >>>> >>>> But this didn't help. Are there other parameters I should change? >>> >>> Hrm, is it exactly the same stack trace? You might need to bump the >>> open files limit as well, although I'd be surprised. :/ >> >> I increased the open file limit as test to 128000, still the same results. >> >> Stack trace: > <snip> > >> But I see some things happening on the system while doing this too: >> >> >> >> [root@ ~]# ceph osd pool set ecdata15 pgp_num 4096 >> set pool 16 pgp_num to 4096 >> [root@ ~]# ceph status >> Traceback (most recent call last): >> File "/usr/bin/ceph", line 830, in <module> >> sys.exit(main()) >> File "/usr/bin/ceph", line 590, in main >> conffile=conffile) >> File "/usr/lib/python2.6/site-packages/rados.py", line 198, in __init__ >> librados_path = find_library('rados') >> File "/usr/lib64/python2.6/ctypes/util.py", line 209, in find_library >> return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name)) >> File "/usr/lib64/python2.6/ctypes/util.py", line 203, in >> _findSoname_ldconfig >> os.popen('LANG=C /sbin/ldconfig -p 2>/dev/null').read()) >> OSError: [Errno 12] Cannot allocate memory >> [root@ ~]# lsof | wc >> -bash: fork: Cannot allocate memory >> [root@ ~]# lsof | wc >> 21801 211209 3230028 >> [root@ ~]# ceph status >> ^CError connecting to cluster: InterruptedOrTimeoutError >> ^[[A[root@ ~]# lsof | wc >> 2028 17476 190947 >> >> >> >> And meanwhile the daemons has then been crashed. >> >> I verified the memory never ran out. > Is there anything in dmesg? It sure looks like the OS thinks it's run > out of memory one way or another. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com May it be related to memory fragmentation? http://dom.as/2014/01/17/on-swapping-and-kernels/