Expanding pg's of an erasure coded pool

Kenneth.Waegeman@xxxxxxxx (Kenneth Waegeman) · Tue, 20 May 2014 12:43:59 +0200

Hi,

On a setup of 400 OSDs (20 nodes, with 20 OSDs per node), I first  
tried to create a erasure coded pool with 4096 pgs, but this crashed  
the cluster.
I then started with 1024 pgs, expanding to 2048 (pg_num and pgp_num),  
when I then try to expand to 4096 (not even quite enough) the cluster  
crashes again. ( Do we need less of pg's with erasure coding?)

The crash starts with individual OSDs crashing, eventually bringing  
down the mons (until there is no more quorum or too few osds)

Out of the logs:

    -16> 2014-05-20 10:31:55.545590 7fd42f34d700  5 -- op tracker -- ,  
seq: 14301, time: 2014-05-20 10:31:55.545590, event: started, request:  
pg_query(0.974 epoch 3315) v3
    -15> 2014-05-20 10:31:55.545776 7fd42f34d700  1 --  
130.246.178.141:6836/10446 --> 130.246.179.191:6826/21854 --  
pg_notify(0.974 epoch 3326) v5 -- ?+0 0xc8b4ec0 con 0x9
026b40
    -14> 2014-05-20 10:31:55.545807 7fd42f34d700  5 -- op tracker -- ,  
seq: 14301, time: 2014-05-20 10:31:55.545807, event: done, request:  
pg_query(0.974 epoch 3315) v3
    -13> 2014-05-20 10:31:55.559661 7fd3fdb0f700  1 --  
130.246.178.141:6837/10446 >> :/0 pipe(0xce0c380 sd=468 :6837 s=0  
pgs=0 cs=0 l=0 c=0x1255f0c0).accept sd=468 130.246.179.191:60618/0
    -12> 2014-05-20 10:31:55.564034 7fd3bf72f700  1 --  
130.246.178.141:6838/10446 >> :/0 pipe(0xe3f2300 sd=596 :6838 s=0  
pgs=0 cs=0 l=0 c=0x129b5ee0).accept sd=596 130.246.179.191:43913/0
    -11> 2014-05-20 10:31:55.627776 7fd42df4b700  1 --  
130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 3 ====  
osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:52.994368) v2 ====  
47+0+0 (855262282 0 0) 0xb6863c0 con 0x1255b9c0
    -10> 2014-05-20 10:31:55.629425 7fd42df4b700  1 --  
130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 4 ====  
osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:53.509621) v2 ====  
47+0+0 (2581193378 0 0) 0x93d6c80 con 0x1255b9c0
     -9> 2014-05-20 10:31:55.631270 7fd42f34d700  1 --  
130.246.178.141:6836/10446 <== osd.169 130.246.179.191:6841/25473 2  
==== pg_query(7.3ffs6 epoch 3326) v3 ==== 144+0+0 (221596234 0 0)  
0x10b994a0 con 0x9383860
     -8> 2014-05-20 10:31:55.631308 7fd42f34d700  5 -- op tracker -- ,  
seq: 14302, time: 2014-05-20 10:31:55.631130, event: header_read,  
request: pg_query(7.3ffs6 epoch 3326) v3
     -7> 2014-05-20 10:31:55.631315 7fd42f34d700  5 -- op tracker -- ,  
seq: 14302, time: 2014-05-20 10:31:55.631133, event: throttled,  
request: pg_query(7.3ffs6 epoch 3326) v3
     -6> 2014-05-20 10:31:55.631339 7fd42f34d700  5 -- op tracker -- ,  
seq: 14302, time: 2014-05-20 10:31:55.631207, event: all_read,  
request: pg_query(7.3ffs6 epoch 3326) v3
     -5> 2014-05-20 10:31:55.631343 7fd42f34d700  5 -- op tracker -- ,  
seq: 14302, time: 2014-05-20 10:31:55.631303, event: dispatched,  
request: pg_query(7.3ffs6 epoch 3326) v3
     -4> 2014-05-20 10:31:55.631349 7fd42f34d700  5 -- op tracker -- ,  
seq: 14302, time: 2014-05-20 10:31:55.631349, event:  
waiting_for_osdmap, request: pg_query(7.3ffs6 epoch 3326) v3
     -3> 2014-05-20 10:31:55.631363 7fd42f34d700  5 -- op tracker -- ,  
seq: 14302, time: 2014-05-20 10:31:55.631363, event: started, request:  
pg_query(7.3ffs6 epoch 3326) v3
     -2> 2014-05-20 10:31:55.631402 7fd42f34d700  5 -- op tracker -- ,  
seq: 14302, time: 2014-05-20 10:31:55.631402, event: done, request:  
pg_query(7.3ffs6 epoch 3326) v3
     -1> 2014-05-20 10:31:55.631488 7fd427b41700  1 --  
130.246.178.141:6836/10446 --> 130.246.179.191:6841/25473 --  
pg_notify(7.3ffs6(14) epoch 3326) v5 -- ?+0 0xcc7b9c0 con 0x9383860
      0> 2014-05-20 10:31:55.632127 7fd42cb49700 -1 common/Thread.cc:  
In function 'void Thread::create(size_t)' thread 7fd42cb49700 time  
2014-05-20 10:31:55.630937
common/Thread.cc: 110: FAILED assert(ret == 0)

  ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
  1: (Thread::create(unsigned long)+0x8a) [0xa83f8a]
  2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xa2a6aa]
  3: (Accepter::entry()+0x265) [0xb3ca45]
  4: (()+0x79d1) [0x7fd4436b19d1]
  5: (clone()+0x6d) [0x7fd4423ecb6d]

--- begin dump of recent events ---
      0> 2014-05-20 10:31:56.622247 7fd3bc5fe700 -1 *** Caught signal  
(Aborted) **
  in thread 7fd3bc5fe700

  ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
  1: /usr/bin/ceph-osd() [0x9ab3b1]
  2: (()+0xf710) [0x7fd4436b9710]
  3: (gsignal()+0x35) [0x7fd442336925]
  4: (abort()+0x175) [0x7fd442338105]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fd442bf0a5d]
  6: (()+0xbcbe6) [0x7fd442beebe6]
  7: (()+0xbcc13) [0x7fd442beec13]
  8: (()+0xbcd0e) [0x7fd442beed0e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x7f2) [0xaec612]
  10: (Thread::create(unsigned long)+0x8a) [0xa83f8a]
  11: (Pipe::connect()+0x2efb) [0xb2850b]
  12: (Pipe::writer()+0x9f3) [0xb2a063]
  13: (Pipe::Writer::entry()+0xd) [0xb359cd]
  14: (()+0x79d1) [0x7fd4436b19d1]
  15: (clone()+0x6d) [0x7fd4423ecb6d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is  
needed to interpret this.

--- begin dump of recent events ---
      0> 2014-05-20 10:37:50.378377 7ff018059700 -1 *** Caught signal  
(Aborted) **
  in thread 7ff018059700

in the mon:
  ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
  1: /usr/bin/ceph-mon() [0x86b991]
  2: (()+0xf710) [0x7ff01ee5b710]
  3: (gsignal()+0x35) [0x7ff01dad8925]
  4: (abort()+0x175) [0x7ff01dada105]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7ff01e392a5d]
  6: (()+0xbcbe6) [0x7ff01e390be6]
  7: (()+0xbcc13) [0x7ff01e390c13]
  8: (()+0xbcd0e) [0x7ff01e390d0e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x7f2) [0x7a5472]
  10: (Thread::create(unsigned long)+0x8a) [0x748c9a]
  11: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0x8351ba]
  12: (Accepter::entry()+0x265) [0x863295]
  13: (()+0x79d1) [0x7ff01ee539d1]
  14: (clone()+0x6d) [0x7ff01db8eb6d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is  
needed to interpret this.

When I make a replicated pool, I can go already to 8192pgs without problem.

Thanks already!!

Kind regards,
Kenneth