Expanding pg's of an erasure coded pool

greg@xxxxxxxxxxx (Gregory Farnum) · Tue, 20 May 2014 10:33:30 -0700

This failure means the messenger subsystem is trying to create a
thread and is getting an error code back ? probably due to a process
or system thread limit that you can turn up with ulimit.

This is happening because a replicated PG primary needs a connection
to only its replicas (generally 1 or 2 connections), but with an
erasure-coded PG the primary requires a connection to m+n-1 replicas
(everybody who's in the erasure-coding set, including itself). Right
now our messenger requires a thread for each connection, so kerblam.
(And it actually requires a couple such connections because we have
separate heartbeat, cluster data, and client data systems.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Tue, May 20, 2014 at 3:43 AM, Kenneth Waegeman
<Kenneth.Waegeman at ugent.be> wrote:
> Hi,
>
> On a setup of 400 OSDs (20 nodes, with 20 OSDs per node), I first tried to
> create a erasure coded pool with 4096 pgs, but this crashed the cluster.
> I then started with 1024 pgs, expanding to 2048 (pg_num and pgp_num), when I
> then try to expand to 4096 (not even quite enough) the cluster crashes
> again. ( Do we need less of pg's with erasure coding?)
>
> The crash starts with individual OSDs crashing, eventually bringing down the
> mons (until there is no more quorum or too few osds)
>
> Out of the logs:
>
>
>    -16> 2014-05-20 10:31:55.545590 7fd42f34d700  5 -- op tracker -- , seq:
> 14301, time: 2014-05-20 10:31:55.545590, event: started, request:
> pg_query(0.974 epoch 3315) v3
>    -15> 2014-05-20 10:31:55.545776 7fd42f34d700  1 --
> 130.246.178.141:6836/10446 --> 130.246.179.191:6826/21854 -- pg_notify(0.974
> epoch 3326) v5 -- ?+0 0xc8b4ec0 con 0x9
> 026b40
>    -14> 2014-05-20 10:31:55.545807 7fd42f34d700  5 -- op tracker -- , seq:
> 14301, time: 2014-05-20 10:31:55.545807, event: done, request:
> pg_query(0.974 epoch 3315) v3
>    -13> 2014-05-20 10:31:55.559661 7fd3fdb0f700  1 --
> 130.246.178.141:6837/10446 >> :/0 pipe(0xce0c380 sd=468 :6837 s=0 pgs=0 cs=0
> l=0 c=0x1255f0c0).accept sd=468 130.246.179.191:60618/0
>    -12> 2014-05-20 10:31:55.564034 7fd3bf72f700  1 --
> 130.246.178.141:6838/10446 >> :/0 pipe(0xe3f2300 sd=596 :6838 s=0 pgs=0 cs=0
> l=0 c=0x129b5ee0).accept sd=596 130.246.179.191:43913/0
>    -11> 2014-05-20 10:31:55.627776 7fd42df4b700  1 --
> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 3 ====
> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:52.994368) v2 ==== 47+0+0
> (855262282 0 0) 0xb6863c0 con 0x1255b9c0
>    -10> 2014-05-20 10:31:55.629425 7fd42df4b700  1 --
> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 4 ====
> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:53.509621) v2 ==== 47+0+0
> (2581193378 0 0) 0x93d6c80 con 0x1255b9c0
>     -9> 2014-05-20 10:31:55.631270 7fd42f34d700  1 --
> 130.246.178.141:6836/10446 <== osd.169 130.246.179.191:6841/25473 2 ====
> pg_query(7.3ffs6 epoch 3326) v3 ==== 144+0+0 (221596234 0 0) 0x10b994a0 con
> 0x9383860
>     -8> 2014-05-20 10:31:55.631308 7fd42f34d700  5 -- op tracker -- , seq:
> 14302, time: 2014-05-20 10:31:55.631130, event: header_read, request:
> pg_query(7.3ffs6 epoch 3326) v3
>     -7> 2014-05-20 10:31:55.631315 7fd42f34d700  5 -- op tracker -- , seq:
> 14302, time: 2014-05-20 10:31:55.631133, event: throttled, request:
> pg_query(7.3ffs6 epoch 3326) v3
>     -6> 2014-05-20 10:31:55.631339 7fd42f34d700  5 -- op tracker -- , seq:
> 14302, time: 2014-05-20 10:31:55.631207, event: all_read, request:
> pg_query(7.3ffs6 epoch 3326) v3
>     -5> 2014-05-20 10:31:55.631343 7fd42f34d700  5 -- op tracker -- , seq:
> 14302, time: 2014-05-20 10:31:55.631303, event: dispatched, request:
> pg_query(7.3ffs6 epoch 3326) v3
>     -4> 2014-05-20 10:31:55.631349 7fd42f34d700  5 -- op tracker -- , seq:
> 14302, time: 2014-05-20 10:31:55.631349, event: waiting_for_osdmap, request:
> pg_query(7.3ffs6 epoch 3326) v3
>     -3> 2014-05-20 10:31:55.631363 7fd42f34d700  5 -- op tracker -- , seq:
> 14302, time: 2014-05-20 10:31:55.631363, event: started, request:
> pg_query(7.3ffs6 epoch 3326) v3
>     -2> 2014-05-20 10:31:55.631402 7fd42f34d700  5 -- op tracker -- , seq:
> 14302, time: 2014-05-20 10:31:55.631402, event: done, request:
> pg_query(7.3ffs6 epoch 3326) v3
>     -1> 2014-05-20 10:31:55.631488 7fd427b41700  1 --
> 130.246.178.141:6836/10446 --> 130.246.179.191:6841/25473 --
> pg_notify(7.3ffs6(14) epoch 3326) v5 -- ?+0 0xcc7b9c0 con 0x9383860
>      0> 2014-05-20 10:31:55.632127 7fd42cb49700 -1 common/Thread.cc: In
> function 'void Thread::create(size_t)' thread 7fd42cb49700 time 2014-05-20
> 10:31:55.630937
> common/Thread.cc: 110: FAILED assert(ret == 0)
>
>  ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>  1: (Thread::create(unsigned long)+0x8a) [0xa83f8a]
>  2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xa2a6aa]
>  3: (Accepter::entry()+0x265) [0xb3ca45]
>  4: (()+0x79d1) [0x7fd4436b19d1]
>  5: (clone()+0x6d) [0x7fd4423ecb6d]
>
> --- begin dump of recent events ---
>      0> 2014-05-20 10:31:56.622247 7fd3bc5fe700 -1 *** Caught signal
> (Aborted) **
>  in thread 7fd3bc5fe700
>
>  ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>  1: /usr/bin/ceph-osd() [0x9ab3b1]
>  2: (()+0xf710) [0x7fd4436b9710]
>  3: (gsignal()+0x35) [0x7fd442336925]
>  4: (abort()+0x175) [0x7fd442338105]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fd442bf0a5d]
>  6: (()+0xbcbe6) [0x7fd442beebe6]
>  7: (()+0xbcc13) [0x7fd442beec13]
>  8: (()+0xbcd0e) [0x7fd442beed0e]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x7f2) [0xaec612]
>  10: (Thread::create(unsigned long)+0x8a) [0xa83f8a]
>  11: (Pipe::connect()+0x2efb) [0xb2850b]
>  12: (Pipe::writer()+0x9f3) [0xb2a063]
>  13: (Pipe::Writer::entry()+0xd) [0xb359cd]
>  14: (()+0x79d1) [0x7fd4436b19d1]
>  15: (clone()+0x6d) [0x7fd4423ecb6d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
>
> --- begin dump of recent events ---
>      0> 2014-05-20 10:37:50.378377 7ff018059700 -1 *** Caught signal
> (Aborted) **
>  in thread 7ff018059700
>
> in the mon:
>  ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>  1: /usr/bin/ceph-mon() [0x86b991]
>  2: (()+0xf710) [0x7ff01ee5b710]
>  3: (gsignal()+0x35) [0x7ff01dad8925]
>  4: (abort()+0x175) [0x7ff01dada105]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7ff01e392a5d]
>  6: (()+0xbcbe6) [0x7ff01e390be6]
>  7: (()+0xbcc13) [0x7ff01e390c13]
>  8: (()+0xbcd0e) [0x7ff01e390d0e]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x7f2) [0x7a5472]
>  10: (Thread::create(unsigned long)+0x8a) [0x748c9a]
>  11: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0x8351ba]
>  12: (Accepter::entry()+0x265) [0x863295]
>  13: (()+0x79d1) [0x7ff01ee539d1]
>  14: (clone()+0x6d) [0x7ff01db8eb6d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
> When I make a replicated pool, I can go already to 8192pgs without problem.
>
> Thanks already!!
>
> Kind regards,
> Kenneth
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com