Expanding pg's of an erasure coded pool

Kenneth.Waegeman@xxxxxxxx (Kenneth Waegeman) · Wed, 21 May 2014 12:52:29 +0200

Thanks! I increased the max processes parameter for all daemons quite  
a lot (until ulimit -u 3802720)

These are the limits for the daemons now..
[root@ ~]# cat /proc/17006/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            10485760             unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             3802720              3802720              processes
Max open files            32768                32768                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       95068                95068                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

But this didn't help. Are there other parameters I should change?

I also got an 'bash: fork: Cannot allocate memory' error once when  
running a command after starting the ceph services. It shouldn't be a  
memory shortage issue itself because when monitoring the failure there  
is still enough (cached) available..

----- Message from Gregory Farnum <greg at inktank.com> ---------
    Date: Tue, 20 May 2014 10:33:30 -0700
    From: Gregory Farnum <greg at inktank.com>
Subject: Re: Expanding pg's of an erasure coded pool
      To: Kenneth Waegeman <Kenneth.Waegeman at ugent.be>
      Cc: ceph-users <ceph-users at lists.ceph.com>

> This failure means the messenger subsystem is trying to create a
> thread and is getting an error code back ? probably due to a process
> or system thread limit that you can turn up with ulimit.
>
> This is happening because a replicated PG primary needs a connection
> to only its replicas (generally 1 or 2 connections), but with an
> erasure-coded PG the primary requires a connection to m+n-1 replicas
> (everybody who's in the erasure-coding set, including itself). Right
> now our messenger requires a thread for each connection, so kerblam.
> (And it actually requires a couple such connections because we have
> separate heartbeat, cluster data, and client data systems.)
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Tue, May 20, 2014 at 3:43 AM, Kenneth Waegeman
> <Kenneth.Waegeman at ugent.be> wrote:
>> Hi,
>>
>> On a setup of 400 OSDs (20 nodes, with 20 OSDs per node), I first tried to
>> create a erasure coded pool with 4096 pgs, but this crashed the cluster.
>> I then started with 1024 pgs, expanding to 2048 (pg_num and pgp_num), when I
>> then try to expand to 4096 (not even quite enough) the cluster crashes
>> again. ( Do we need less of pg's with erasure coding?)
>>
>> The crash starts with individual OSDs crashing, eventually bringing down the
>> mons (until there is no more quorum or too few osds)
>>
>> Out of the logs:
>>
>>
>>    -16> 2014-05-20 10:31:55.545590 7fd42f34d700  5 -- op tracker -- , seq:
>> 14301, time: 2014-05-20 10:31:55.545590, event: started, request:
>> pg_query(0.974 epoch 3315) v3
>>    -15> 2014-05-20 10:31:55.545776 7fd42f34d700  1 --
>> 130.246.178.141:6836/10446 --> 130.246.179.191:6826/21854 -- pg_notify(0.974
>> epoch 3326) v5 -- ?+0 0xc8b4ec0 con 0x9
>> 026b40
>>    -14> 2014-05-20 10:31:55.545807 7fd42f34d700  5 -- op tracker -- , seq:
>> 14301, time: 2014-05-20 10:31:55.545807, event: done, request:
>> pg_query(0.974 epoch 3315) v3
>>    -13> 2014-05-20 10:31:55.559661 7fd3fdb0f700  1 --
>> 130.246.178.141:6837/10446 >> :/0 pipe(0xce0c380 sd=468 :6837 s=0 pgs=0 cs=0
>> l=0 c=0x1255f0c0).accept sd=468 130.246.179.191:60618/0
>>    -12> 2014-05-20 10:31:55.564034 7fd3bf72f700  1 --
>> 130.246.178.141:6838/10446 >> :/0 pipe(0xe3f2300 sd=596 :6838 s=0 pgs=0 cs=0
>> l=0 c=0x129b5ee0).accept sd=596 130.246.179.191:43913/0
>>    -11> 2014-05-20 10:31:55.627776 7fd42df4b700  1 --
>> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 3 ====
>> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:52.994368) v2 ==== 47+0+0
>> (855262282 0 0) 0xb6863c0 con 0x1255b9c0
>>    -10> 2014-05-20 10:31:55.629425 7fd42df4b700  1 --
>> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 4 ====
>> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:53.509621) v2 ==== 47+0+0
>> (2581193378 0 0) 0x93d6c80 con 0x1255b9c0
>>     -9> 2014-05-20 10:31:55.631270 7fd42f34d700  1 --
>> 130.246.178.141:6836/10446 <== osd.169 130.246.179.191:6841/25473 2 ====
>> pg_query(7.3ffs6 epoch 3326) v3 ==== 144+0+0 (221596234 0 0) 0x10b994a0 con
>> 0x9383860
>>     -8> 2014-05-20 10:31:55.631308 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631130, event: header_read, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>     -7> 2014-05-20 10:31:55.631315 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631133, event: throttled, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>     -6> 2014-05-20 10:31:55.631339 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631207, event: all_read, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>     -5> 2014-05-20 10:31:55.631343 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631303, event: dispatched, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>     -4> 2014-05-20 10:31:55.631349 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631349, event: waiting_for_osdmap, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>     -3> 2014-05-20 10:31:55.631363 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631363, event: started, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>     -2> 2014-05-20 10:31:55.631402 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631402, event: done, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>     -1> 2014-05-20 10:31:55.631488 7fd427b41700  1 --
>> 130.246.178.141:6836/10446 --> 130.246.179.191:6841/25473 --
>> pg_notify(7.3ffs6(14) epoch 3326) v5 -- ?+0 0xcc7b9c0 con 0x9383860
>>      0> 2014-05-20 10:31:55.632127 7fd42cb49700 -1 common/Thread.cc: In
>> function 'void Thread::create(size_t)' thread 7fd42cb49700 time 2014-05-20
>> 10:31:55.630937
>> common/Thread.cc: 110: FAILED assert(ret == 0)
>>
>>  ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>>  1: (Thread::create(unsigned long)+0x8a) [0xa83f8a]
>>  2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xa2a6aa]
>>  3: (Accepter::entry()+0x265) [0xb3ca45]
>>  4: (()+0x79d1) [0x7fd4436b19d1]
>>  5: (clone()+0x6d) [0x7fd4423ecb6d]
>>
>> --- begin dump of recent events ---
>>      0> 2014-05-20 10:31:56.622247 7fd3bc5fe700 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7fd3bc5fe700
>>
>>  ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>>  1: /usr/bin/ceph-osd() [0x9ab3b1]
>>  2: (()+0xf710) [0x7fd4436b9710]
>>  3: (gsignal()+0x35) [0x7fd442336925]
>>  4: (abort()+0x175) [0x7fd442338105]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fd442bf0a5d]
>>  6: (()+0xbcbe6) [0x7fd442beebe6]
>>  7: (()+0xbcc13) [0x7fd442beec13]
>>  8: (()+0xbcd0e) [0x7fd442beed0e]
>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x7f2) [0xaec612]
>>  10: (Thread::create(unsigned long)+0x8a) [0xa83f8a]
>>  11: (Pipe::connect()+0x2efb) [0xb2850b]
>>  12: (Pipe::writer()+0x9f3) [0xb2a063]
>>  13: (Pipe::Writer::entry()+0xd) [0xb359cd]
>>  14: (()+0x79d1) [0x7fd4436b19d1]
>>  15: (clone()+0x6d) [0x7fd4423ecb6d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>>
>>
>> --- begin dump of recent events ---
>>      0> 2014-05-20 10:37:50.378377 7ff018059700 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7ff018059700
>>
>> in the mon:
>>  ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>>  1: /usr/bin/ceph-mon() [0x86b991]
>>  2: (()+0xf710) [0x7ff01ee5b710]
>>  3: (gsignal()+0x35) [0x7ff01dad8925]
>>  4: (abort()+0x175) [0x7ff01dada105]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7ff01e392a5d]
>>  6: (()+0xbcbe6) [0x7ff01e390be6]
>>  7: (()+0xbcc13) [0x7ff01e390c13]
>>  8: (()+0xbcd0e) [0x7ff01e390d0e]
>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x7f2) [0x7a5472]
>>  10: (Thread::create(unsigned long)+0x8a) [0x748c9a]
>>  11: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0x8351ba]
>>  12: (Accepter::entry()+0x265) [0x863295]
>>  13: (()+0x79d1) [0x7ff01ee539d1]
>>  14: (clone()+0x6d) [0x7ff01db8eb6d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>>
>> When I make a replicated pool, I can go already to 8192pgs without problem.
>>
>> Thanks already!!
>>
>> Kind regards,
>> Kenneth
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

----- End message from Gregory Farnum <greg at inktank.com> -----

-- 

Met vriendelijke groeten,
Kenneth Waegeman