Expanding pg's of an erasure coded pool

yguang11@xxxxxxxxx (Guang Yang) · Mon, 26 May 2014 09:24:48 +0800

On May 21, 2014, at 1:33 AM, Gregory Farnum <greg at inktank.com> wrote:

> This failure means the messenger subsystem is trying to create a
> thread and is getting an error code back ? probably due to a process
> or system thread limit that you can turn up with ulimit.
> 
> This is happening because a replicated PG primary needs a connection
> to only its replicas (generally 1 or 2 connections), but with an
> erasure-coded PG the primary requires a connection to m+n-1 replicas
> (everybody who's in the erasure-coding set, including itself). Right
> now our messenger requires a thread for each connection, so kerblam.
> (And it actually requires a couple such connections because we have
> separate heartbeat, cluster data, and client data systems.)
Hi Greg,
Is there any plan to refactor the messenger component to reduce the num of threads? For example, use event-driven mode.

Thanks,
Guang
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> On Tue, May 20, 2014 at 3:43 AM, Kenneth Waegeman
> <Kenneth.Waegeman at ugent.be> wrote:
>> Hi,
>> 
>> On a setup of 400 OSDs (20 nodes, with 20 OSDs per node), I first tried to
>> create a erasure coded pool with 4096 pgs, but this crashed the cluster.
>> I then started with 1024 pgs, expanding to 2048 (pg_num and pgp_num), when I
>> then try to expand to 4096 (not even quite enough) the cluster crashes
>> again. ( Do we need less of pg's with erasure coding?)
>> 
>> The crash starts with individual OSDs crashing, eventually bringing down the
>> mons (until there is no more quorum or too few osds)
>> 
>> Out of the logs:
>> 
>> 
>>   -16> 2014-05-20 10:31:55.545590 7fd42f34d700  5 -- op tracker -- , seq:
>> 14301, time: 2014-05-20 10:31:55.545590, event: started, request:
>> pg_query(0.974 epoch 3315) v3
>>   -15> 2014-05-20 10:31:55.545776 7fd42f34d700  1 --
>> 130.246.178.141:6836/10446 --> 130.246.179.191:6826/21854 -- pg_notify(0.974
>> epoch 3326) v5 -- ?+0 0xc8b4ec0 con 0x9
>> 026b40
>>   -14> 2014-05-20 10:31:55.545807 7fd42f34d700  5 -- op tracker -- , seq:
>> 14301, time: 2014-05-20 10:31:55.545807, event: done, request:
>> pg_query(0.974 epoch 3315) v3
>>   -13> 2014-05-20 10:31:55.559661 7fd3fdb0f700  1 --
>> 130.246.178.141:6837/10446 >> :/0 pipe(0xce0c380 sd=468 :6837 s=0 pgs=0 cs=0
>> l=0 c=0x1255f0c0).accept sd=468 130.246.179.191:60618/0
>>   -12> 2014-05-20 10:31:55.564034 7fd3bf72f700  1 --
>> 130.246.178.141:6838/10446 >> :/0 pipe(0xe3f2300 sd=596 :6838 s=0 pgs=0 cs=0
>> l=0 c=0x129b5ee0).accept sd=596 130.246.179.191:43913/0
>>   -11> 2014-05-20 10:31:55.627776 7fd42df4b700  1 --
>> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 3 ====
>> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:52.994368) v2 ==== 47+0+0
>> (855262282 0 0) 0xb6863c0 con 0x1255b9c0
>>   -10> 2014-05-20 10:31:55.629425 7fd42df4b700  1 --
>> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 4 ====
>> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:53.509621) v2 ==== 47+0+0
>> (2581193378 0 0) 0x93d6c80 con 0x1255b9c0
>>    -9> 2014-05-20 10:31:55.631270 7fd42f34d700  1 --
>> 130.246.178.141:6836/10446 <== osd.169 130.246.179.191:6841/25473 2 ====
>> pg_query(7.3ffs6 epoch 3326) v3 ==== 144+0+0 (221596234 0 0) 0x10b994a0 con
>> 0x9383860
>>    -8> 2014-05-20 10:31:55.631308 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631130, event: header_read, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>    -7> 2014-05-20 10:31:55.631315 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631133, event: throttled, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>    -6> 2014-05-20 10:31:55.631339 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631207, event: all_read, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>    -5> 2014-05-20 10:31:55.631343 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631303, event: dispatched, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>    -4> 2014-05-20 10:31:55.631349 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631349, event: waiting_for_osdmap, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>    -3> 2014-05-20 10:31:55.631363 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631363, event: started, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>    -2> 2014-05-20 10:31:55.631402 7fd42f34d700  5 -- op tracker -- , seq:
>> 14302, time: 2014-05-20 10:31:55.631402, event: done, request:
>> pg_query(7.3ffs6 epoch 3326) v3
>>    -1> 2014-05-20 10:31:55.631488 7fd427b41700  1 --
>> 130.246.178.141:6836/10446 --> 130.246.179.191:6841/25473 --
>> pg_notify(7.3ffs6(14) epoch 3326) v5 -- ?+0 0xcc7b9c0 con 0x9383860
>>     0> 2014-05-20 10:31:55.632127 7fd42cb49700 -1 common/Thread.cc: In
>> function 'void Thread::create(size_t)' thread 7fd42cb49700 time 2014-05-20
>> 10:31:55.630937
>> common/Thread.cc: 110: FAILED assert(ret == 0)
>> 
>> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>> 1: (Thread::create(unsigned long)+0x8a) [0xa83f8a]
>> 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xa2a6aa]
>> 3: (Accepter::entry()+0x265) [0xb3ca45]
>> 4: (()+0x79d1) [0x7fd4436b19d1]
>> 5: (clone()+0x6d) [0x7fd4423ecb6d]
>> 
>> --- begin dump of recent events ---
>>     0> 2014-05-20 10:31:56.622247 7fd3bc5fe700 -1 *** Caught signal
>> (Aborted) **
>> in thread 7fd3bc5fe700
>> 
>> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>> 1: /usr/bin/ceph-osd() [0x9ab3b1]
>> 2: (()+0xf710) [0x7fd4436b9710]
>> 3: (gsignal()+0x35) [0x7fd442336925]
>> 4: (abort()+0x175) [0x7fd442338105]
>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fd442bf0a5d]
>> 6: (()+0xbcbe6) [0x7fd442beebe6]
>> 7: (()+0xbcc13) [0x7fd442beec13]
>> 8: (()+0xbcd0e) [0x7fd442beed0e]
>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x7f2) [0xaec612]
>> 10: (Thread::create(unsigned long)+0x8a) [0xa83f8a]
>> 11: (Pipe::connect()+0x2efb) [0xb2850b]
>> 12: (Pipe::writer()+0x9f3) [0xb2a063]
>> 13: (Pipe::Writer::entry()+0xd) [0xb359cd]
>> 14: (()+0x79d1) [0x7fd4436b19d1]
>> 15: (clone()+0x6d) [0x7fd4423ecb6d]
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>> 
>> 
>> --- begin dump of recent events ---
>>     0> 2014-05-20 10:37:50.378377 7ff018059700 -1 *** Caught signal
>> (Aborted) **
>> in thread 7ff018059700
>> 
>> in the mon:
>> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>> 1: /usr/bin/ceph-mon() [0x86b991]
>> 2: (()+0xf710) [0x7ff01ee5b710]
>> 3: (gsignal()+0x35) [0x7ff01dad8925]
>> 4: (abort()+0x175) [0x7ff01dada105]
>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7ff01e392a5d]
>> 6: (()+0xbcbe6) [0x7ff01e390be6]
>> 7: (()+0xbcc13) [0x7ff01e390c13]
>> 8: (()+0xbcd0e) [0x7ff01e390d0e]
>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x7f2) [0x7a5472]
>> 10: (Thread::create(unsigned long)+0x8a) [0x748c9a]
>> 11: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0x8351ba]
>> 12: (Accepter::entry()+0x265) [0x863295]
>> 13: (()+0x79d1) [0x7ff01ee539d1]
>> 14: (clone()+0x6d) [0x7ff01db8eb6d]
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>> 
>> When I make a replicated pool, I can go already to 8192pgs without problem.
>> 
>> Thanks already!!
>> 
>> Kind regards,
>> Kenneth
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com