On May 21, 2014, at 1:33 AM, Gregory Farnum <greg at inktank.com> wrote: > This failure means the messenger subsystem is trying to create a > thread and is getting an error code back ? probably due to a process > or system thread limit that you can turn up with ulimit. > > This is happening because a replicated PG primary needs a connection > to only its replicas (generally 1 or 2 connections), but with an > erasure-coded PG the primary requires a connection to m+n-1 replicas > (everybody who's in the erasure-coding set, including itself). Right > now our messenger requires a thread for each connection, so kerblam. > (And it actually requires a couple such connections because we have > separate heartbeat, cluster data, and client data systems.) Hi Greg, Is there any plan to refactor the messenger component to reduce the num of threads? For example, use event-driven mode. Thanks, Guang > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > On Tue, May 20, 2014 at 3:43 AM, Kenneth Waegeman > <Kenneth.Waegeman at ugent.be> wrote: >> Hi, >> >> On a setup of 400 OSDs (20 nodes, with 20 OSDs per node), I first tried to >> create a erasure coded pool with 4096 pgs, but this crashed the cluster. >> I then started with 1024 pgs, expanding to 2048 (pg_num and pgp_num), when I >> then try to expand to 4096 (not even quite enough) the cluster crashes >> again. ( Do we need less of pg's with erasure coding?) >> >> The crash starts with individual OSDs crashing, eventually bringing down the >> mons (until there is no more quorum or too few osds) >> >> Out of the logs: >> >> >> -16> 2014-05-20 10:31:55.545590 7fd42f34d700 5 -- op tracker -- , seq: >> 14301, time: 2014-05-20 10:31:55.545590, event: started, request: >> pg_query(0.974 epoch 3315) v3 >> -15> 2014-05-20 10:31:55.545776 7fd42f34d700 1 -- >> 130.246.178.141:6836/10446 --> 130.246.179.191:6826/21854 -- pg_notify(0.974 >> epoch 3326) v5 -- ?+0 0xc8b4ec0 con 0x9 >> 026b40 >> -14> 2014-05-20 10:31:55.545807 7fd42f34d700 5 -- op tracker -- , seq: >> 14301, time: 2014-05-20 10:31:55.545807, event: done, request: >> pg_query(0.974 epoch 3315) v3 >> -13> 2014-05-20 10:31:55.559661 7fd3fdb0f700 1 -- >> 130.246.178.141:6837/10446 >> :/0 pipe(0xce0c380 sd=468 :6837 s=0 pgs=0 cs=0 >> l=0 c=0x1255f0c0).accept sd=468 130.246.179.191:60618/0 >> -12> 2014-05-20 10:31:55.564034 7fd3bf72f700 1 -- >> 130.246.178.141:6838/10446 >> :/0 pipe(0xe3f2300 sd=596 :6838 s=0 pgs=0 cs=0 >> l=0 c=0x129b5ee0).accept sd=596 130.246.179.191:43913/0 >> -11> 2014-05-20 10:31:55.627776 7fd42df4b700 1 -- >> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 3 ==== >> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:52.994368) v2 ==== 47+0+0 >> (855262282 0 0) 0xb6863c0 con 0x1255b9c0 >> -10> 2014-05-20 10:31:55.629425 7fd42df4b700 1 -- >> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 4 ==== >> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:53.509621) v2 ==== 47+0+0 >> (2581193378 0 0) 0x93d6c80 con 0x1255b9c0 >> -9> 2014-05-20 10:31:55.631270 7fd42f34d700 1 -- >> 130.246.178.141:6836/10446 <== osd.169 130.246.179.191:6841/25473 2 ==== >> pg_query(7.3ffs6 epoch 3326) v3 ==== 144+0+0 (221596234 0 0) 0x10b994a0 con >> 0x9383860 >> -8> 2014-05-20 10:31:55.631308 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631130, event: header_read, request: >> pg_query(7.3ffs6 epoch 3326) v3 >> -7> 2014-05-20 10:31:55.631315 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631133, event: throttled, request: >> pg_query(7.3ffs6 epoch 3326) v3 >> -6> 2014-05-20 10:31:55.631339 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631207, event: all_read, request: >> pg_query(7.3ffs6 epoch 3326) v3 >> -5> 2014-05-20 10:31:55.631343 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631303, event: dispatched, request: >> pg_query(7.3ffs6 epoch 3326) v3 >> -4> 2014-05-20 10:31:55.631349 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631349, event: waiting_for_osdmap, request: >> pg_query(7.3ffs6 epoch 3326) v3 >> -3> 2014-05-20 10:31:55.631363 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631363, event: started, request: >> pg_query(7.3ffs6 epoch 3326) v3 >> -2> 2014-05-20 10:31:55.631402 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631402, event: done, request: >> pg_query(7.3ffs6 epoch 3326) v3 >> -1> 2014-05-20 10:31:55.631488 7fd427b41700 1 -- >> 130.246.178.141:6836/10446 --> 130.246.179.191:6841/25473 -- >> pg_notify(7.3ffs6(14) epoch 3326) v5 -- ?+0 0xcc7b9c0 con 0x9383860 >> 0> 2014-05-20 10:31:55.632127 7fd42cb49700 -1 common/Thread.cc: In >> function 'void Thread::create(size_t)' thread 7fd42cb49700 time 2014-05-20 >> 10:31:55.630937 >> common/Thread.cc: 110: FAILED assert(ret == 0) >> >> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) >> 1: (Thread::create(unsigned long)+0x8a) [0xa83f8a] >> 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xa2a6aa] >> 3: (Accepter::entry()+0x265) [0xb3ca45] >> 4: (()+0x79d1) [0x7fd4436b19d1] >> 5: (clone()+0x6d) [0x7fd4423ecb6d] >> >> --- begin dump of recent events --- >> 0> 2014-05-20 10:31:56.622247 7fd3bc5fe700 -1 *** Caught signal >> (Aborted) ** >> in thread 7fd3bc5fe700 >> >> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) >> 1: /usr/bin/ceph-osd() [0x9ab3b1] >> 2: (()+0xf710) [0x7fd4436b9710] >> 3: (gsignal()+0x35) [0x7fd442336925] >> 4: (abort()+0x175) [0x7fd442338105] >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fd442bf0a5d] >> 6: (()+0xbcbe6) [0x7fd442beebe6] >> 7: (()+0xbcc13) [0x7fd442beec13] >> 8: (()+0xbcd0e) [0x7fd442beed0e] >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x7f2) [0xaec612] >> 10: (Thread::create(unsigned long)+0x8a) [0xa83f8a] >> 11: (Pipe::connect()+0x2efb) [0xb2850b] >> 12: (Pipe::writer()+0x9f3) [0xb2a063] >> 13: (Pipe::Writer::entry()+0xd) [0xb359cd] >> 14: (()+0x79d1) [0x7fd4436b19d1] >> 15: (clone()+0x6d) [0x7fd4423ecb6d] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >> interpret this. >> >> >> --- begin dump of recent events --- >> 0> 2014-05-20 10:37:50.378377 7ff018059700 -1 *** Caught signal >> (Aborted) ** >> in thread 7ff018059700 >> >> in the mon: >> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) >> 1: /usr/bin/ceph-mon() [0x86b991] >> 2: (()+0xf710) [0x7ff01ee5b710] >> 3: (gsignal()+0x35) [0x7ff01dad8925] >> 4: (abort()+0x175) [0x7ff01dada105] >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7ff01e392a5d] >> 6: (()+0xbcbe6) [0x7ff01e390be6] >> 7: (()+0xbcc13) [0x7ff01e390c13] >> 8: (()+0xbcd0e) [0x7ff01e390d0e] >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x7f2) [0x7a5472] >> 10: (Thread::create(unsigned long)+0x8a) [0x748c9a] >> 11: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0x8351ba] >> 12: (Accepter::entry()+0x265) [0x863295] >> 13: (()+0x79d1) [0x7ff01ee539d1] >> 14: (clone()+0x6d) [0x7ff01db8eb6d] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >> interpret this. >> >> When I make a replicated pool, I can go already to 8192pgs without problem. >> >> Thanks already!! >> >> Kind regards, >> Kenneth >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com