Expanding pg's of an erasure coded pool

greg@xxxxxxxxxxx (Gregory Farnum) · Thu, 22 May 2014 09:55:59 -0700



On Thu, May 22, 2014 at 4:09 AM, Kenneth Waegeman
<Kenneth.Waegeman at ugent.be> wrote:
>
> ----- Message from Gregory Farnum <greg at inktank.com> ---------
>    Date: Wed, 21 May 2014 15:46:17 -0700
>
>    From: Gregory Farnum <greg at inktank.com>
> Subject: Re: Expanding pg's of an erasure coded pool
>      To: Kenneth Waegeman <Kenneth.Waegeman at ugent.be>
>      Cc: ceph-users <ceph-users at lists.ceph.com>
>
>
>> On Wed, May 21, 2014 at 3:52 AM, Kenneth Waegeman
>> <Kenneth.Waegeman at ugent.be> wrote:
>>>
>>> Thanks! I increased the max processes parameter for all daemons quite a
>>> lot
>>> (until ulimit -u 3802720)
>>>
>>> These are the limits for the daemons now..
>>> [root@ ~]# cat /proc/17006/limits
>>> Limit                     Soft Limit           Hard Limit           Units
>>> Max cpu time              unlimited            unlimited
>>> seconds
>>> Max file size             unlimited            unlimited            bytes
>>> Max data size             unlimited            unlimited            bytes
>>> Max stack size            10485760             unlimited            bytes
>>> Max core file size        unlimited            unlimited            bytes
>>> Max resident set          unlimited            unlimited            bytes
>>> Max processes             3802720              3802720
>>> processes
>>> Max open files            32768                32768                files
>>> Max locked memory         65536                65536                bytes
>>> Max address space         unlimited            unlimited            bytes
>>> Max file locks            unlimited            unlimited            locks
>>> Max pending signals       95068                95068
>>> signals
>>> Max msgqueue size         819200               819200               bytes
>>> Max nice priority         0                    0
>>> Max realtime priority     0                    0
>>> Max realtime timeout      unlimited            unlimited            us
>>>
>>> But this didn't help. Are there other parameters I should change?
>>
>>
>> Hrm, is it exactly the same stack trace? You might need to bump the
>> open files limit as well, although I'd be surprised. :/
>
>
> I increased the open file limit as test to 128000, still the same results.
>
> Stack trace:

<snip>

> But I see some things happening on the system while doing this too:
>
>
>
> [root@ ~]# ceph osd pool set ecdata15 pgp_num 4096
> set pool 16 pgp_num to 4096
> [root@ ~]# ceph status
> Traceback (most recent call last):
>   File "/usr/bin/ceph", line 830, in <module>
>     sys.exit(main())
>   File "/usr/bin/ceph", line 590, in main
>     conffile=conffile)
>   File "/usr/lib/python2.6/site-packages/rados.py", line 198, in __init__
>     librados_path = find_library('rados')
>   File "/usr/lib64/python2.6/ctypes/util.py", line 209, in find_library
>     return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
>   File "/usr/lib64/python2.6/ctypes/util.py", line 203, in
> _findSoname_ldconfig
>     os.popen('LANG=C /sbin/ldconfig -p 2>/dev/null').read())
> OSError: [Errno 12] Cannot allocate memory
> [root@ ~]# lsof | wc
> -bash: fork: Cannot allocate memory
> [root@ ~]# lsof | wc
>   21801  211209 3230028
> [root@ ~]# ceph status
> ^CError connecting to cluster: InterruptedOrTimeoutError
> ^[[A[root@ ~]# lsof | wc
>    2028   17476  190947
>
>
>
> And meanwhile the daemons has then been crashed.
>
> I verified the memory never ran out.

Is there anything in dmesg? It sure looks like the OS thinks it's run
out of memory one way or another.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com