Looks like it's failing to create a thread. Try setting kernel.pid_max to 4194303 in /etc/sysctl.conf Cheers, Brad ----- Original Message ----- > From: "Kenneth Waegeman" <kenneth.waegeman@xxxxxxxx> > To: ceph-users@xxxxxxxxxxxxxx > Sent: Tuesday, 8 December, 2015 10:45:11 PM > Subject: ceph new installation of ceph 0.9.2 issue and crashing osds > > Hi, > > I installed ceph 0.9.2 on a new cluster of 3 nodes, with 50 OSDs on each > node (300GB disks, 96GB RAM) > > While installing, I got some issue that I even could not login as ceph > user. So I increased some limits: > security/limits.conf > > ceph - nproc 1048576 > ceph - nofile 1048576 > > I could then install the other OSDs. > > After the cluster was installed, I added some extra pools. when creating > the pgs of these pools, the osds of the cluster started to fail, with > stacktraces. If I try to restart them, they keep on failing. I don't > know if this is an actual bug of Infernalis, or a limit that is still > not high enough.. I've increased the noproc and nofile entries even > more, but no luck. Someone has a clue? Hereby the stacktraces I see: > > Mostly this one: > > -12> 2015-12-08 10:17:18.995243 7fa9063c5700 5 osd.12 pg_epoch: 904 > pg[3.3b(unlocked)] enter Initial > -11> 2015-12-08 10:17:18.995279 7fa9063c5700 5 write_log with: > dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, > dirty_divergent_priors: false, divergent_priors: 0, writeout_from: > 4294967295'184467 > 44073709551615, trimmed: > -10> 2015-12-08 10:17:18.995292 7fa9063c5700 5 osd.12 pg_epoch: 904 > pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) > [12,80,111] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive] exit Initial 0.000048 > 0 0.000000 > -9> 2015-12-08 10:17:18.995301 7fa9063c5700 5 osd.12 pg_epoch: 904 > pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) > [12,80,111] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive] enter Reset > -8> 2015-12-08 10:17:18.995310 7fa9063c5700 5 osd.12 pg_epoch: 904 > pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) > [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] exit Reset 0.000008 > 1 0.000017 > -7> 2015-12-08 10:17:18.995326 7fa9063c5700 5 osd.12 pg_epoch: 904 > pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) > [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] enter Started > -6> 2015-12-08 10:17:18.995332 7fa9063c5700 5 osd.12 pg_epoch: 904 > pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) > [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] enter Start > -5> 2015-12-08 10:17:18.995338 7fa9063c5700 1 osd.12 pg_epoch: 904 > pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) > [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] state<Start>: transi > tioning to Primary > -4> 2015-12-08 10:17:18.995345 7fa9063c5700 5 osd.12 pg_epoch: 904 > pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) > [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] exit Start 0.000012 > 0 0.000000 > -3> 2015-12-08 10:17:18.995352 7fa9063c5700 5 osd.12 pg_epoch: 904 > pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) > [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] enter Started/Primar > y > -2> 2015-12-08 10:17:18.995358 7fa9063c5700 5 osd.12 pg_epoch: 904 > pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) > [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 creating] enter Started/Primar > y/Peering > -1> 2015-12-08 10:17:18.995365 7fa9063c5700 5 osd.12 pg_epoch: 904 > pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) > [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 creating+peering] enter Starte > d/Primary/Peering/GetInfo > 0> 2015-12-08 10:17:18.998472 7fa9063c5700 -1 common/Thread.cc: In > function 'void Thread::create(size_t)' thread 7fa9063c5700 time > 2015-12-08 10:17:18.995438 > common/Thread.cc: 154: FAILED assert(ret == 0) > > ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x85) [0x7fa91924ebe5] > 2: (Thread::create(unsigned long)+0x8a) [0x7fa91923325a] > 3: (SimpleMessenger::connect_rank(entity_addr_t const&, int, > PipeConnection*, Message*)+0x185) [0x7fa919229105] > 4: (SimpleMessenger::get_connection(entity_inst_t const&)+0x3ba) > [0x7fa9192298ea] > 5: (OSDService::get_con_osd_cluster(int, unsigned int)+0x1ab) > [0x7fa918c7318b] > 6: (OSD::do_queries(std::map<int, std::map<spg_t, pg_query_t, > std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > >, > std::less<int>, std::allocator<std::pair<int const, std::map<spg_t, > pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, > pg_query_t> > > > > >&, std::shared_ptr<OSDMap const>)+0x1f1) > [0x7fa918c9b061] > 7: (OSD::dispatch_context(PG::RecoveryCtx&, PG*, > std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x142) > [0x7fa918cb5832] > 8: (OSD::handle_pg_create(std::shared_ptr<OpRequest>)+0x133e) > [0x7fa918cb820e] > 9: (OSD::dispatch_op(std::shared_ptr<OpRequest>)+0x220) [0x7fa918cbc0c0] > 10: (OSD::do_waiters()+0x1c2) [0x7fa918cbc382] > 11: (OSD::ms_dispatch(Message*)+0x227) [0x7fa918cbd727] > 12: (DispatchQueue::entry()+0x649) [0x7fa91930a939] > 13: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fa91922eb1d] > 14: (()+0x7df5) [0x7fa9172e3df5] > 15: (clone()+0x6d) [0x7fa915b8c1ad] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > Also these: > > --- begin dump of recent events --- > -13> 2015-12-08 10:17:19.033845 7f409fa08700 5 osd.15 pg_epoch: 903 > pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902) > [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte > d/Primary/Peering/GetInfo 2.225918 4 0.000124 > -12> 2015-12-08 10:17:19.033874 7f409fa08700 5 osd.15 pg_epoch: 903 > pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902) > [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] enter Start > ed/Primary/Peering/GetLog > -11> 2015-12-08 10:17:19.033920 7f409fa08700 5 osd.15 pg_epoch: 903 > pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902) > [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte > d/Primary/Peering/GetLog 0.000046 0 0.000000 > -10> 2015-12-08 10:17:19.033936 7f409fa08700 5 osd.15 pg_epoch: 903 > pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902) > [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] enter Start > ed/Primary/Peering/GetMissing > -9> 2015-12-08 10:17:19.033949 7f409fa08700 5 osd.15 pg_epoch: 903 > pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902) > [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte > d/Primary/Peering/GetMissing 0.000013 0 0.000000 > -8> 2015-12-08 10:17:19.033962 7f409fa08700 5 osd.15 pg_epoch: 903 > pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902) > [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte > d/Primary/Peering 2.226044 0 0.000000 > -7> 2015-12-08 10:17:19.033975 7f409fa08700 5 osd.15 pg_epoch: 903 > pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902) > [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating] enter Started/Prima > ry/Active > -6> 2015-12-08 10:17:19.060423 7f40a4a12700 1 -- > 10.143.20.31:6863/8526 <== osd.94 10.143.20.32:0/13947 2 ==== > osd_ping(ping e903 stamp 2015-12-08 10:17:19.059261) v2 ==== 47+0+0 > (3897539321 0 0) 0x7f40bffa > b400 con 0x7f40c3baf8c0 > -5> 2015-12-08 10:17:19.060447 7f40a4a12700 1 -- > 10.143.20.31:6863/8526 --> 10.143.20.32:0/13947 -- osd_ping(ping_reply > e903 stamp 2015-12-08 10:17:19.059261) v2 -- ?+0 0x7f40c33f4000 con > 0x7f40c3baf8c0 > -4> 2015-12-08 10:17:19.060573 7f40a320f700 1 -- > 10.143.20.31:6862/8526 <== osd.94 10.143.20.32:0/13947 2 ==== > osd_ping(ping e903 stamp 2015-12-08 10:17:19.059261) v2 ==== 47+0+0 > (3897539321 0 0) 0x7f40bffa > b000 con 0x7f40c3bb1860 > -3> 2015-12-08 10:17:19.069801 7f40a0a0a700 10 monclient: tick > -2> 2015-12-08 10:17:19.069814 7f40a0a0a700 10 monclient: > _check_auth_rotating have uptodate secrets (they expire after 2015-12-08 > 10:16:49.069813) > -1> 2015-12-08 10:17:19.069820 7f40a0a0a700 10 monclient: renew > subs? (now: 2015-12-08 10:17:19.069820; renew after: 2015-12-08 > 10:19:46.766797) -- no > 0> 2015-12-08 10:17:19.121951 7f40a6215700 -1 *** Caught signal > (Aborted) ** > in thread 7f40a6215700 > > ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) > 1: (()+0x7e6ab2) [0x7f40bb7aeab2] > 2: (()+0xf130) [0x7f40b9940130] > 3: (gsignal()+0x37) [0x7f40b81205d7] > 4: (abort()+0x148) [0x7f40b8121cc8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f40b8a249b5] > 6: (()+0x5e926) [0x7f40b8a22926] > 7: (()+0x5e953) [0x7f40b8a22953] > 8: (()+0x5eb73) [0x7f40b8a22b73] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x27a) [0x7f40bb8a3dda] > 10: (Thread::create(unsigned long)+0x8a) [0x7f40bb88825a] > 11: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x7f40bb87df0f] > 12: (Accepter::entry()+0x365) [0x7f40bb941155] > 13: (()+0x7df5) [0x7f40b9938df5] > 14: (clone()+0x6d) [0x7f40b81e11ad] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > > -11> 2015-12-08 10:17:19.028810 7f7c7fe48700 5 -- op tracker -- > seq: 207, time: 2015-12-08 10:17:18.969336, event: dispatched, op: > osd_pg_create(pg1.8d,902@2015-12-08 10:17:16.612111; pg1.a6,902@2015-12-08 1 > 0:17:16.612121; pg1.f0,902@2015-12-08 10:17:16.612151; > pg1.1f5,902@2015-12-08 10:17:16.612249; pg1.232,902@2015-12-08 > 10:17:16.612272; pg1.250,902@2015-12-08 10:17:16.612290; > pg1.27a,902@2015-12-08 10:17:16.6123 > 03; pg1.2f8,902@2015-12-08 10:17:16.612363; pg1.322,902@2015-12-08 > 10:17:16.612381; pg1.323,902@2015-12-08 10:17:16.612381; > pg1.421,902@2015-12-08 10:17:16.612487; pg1.44d,902@2015-12-08 > 10:17:16.612505; pg1.55e > ,902@2015-12-08 10:17:16.612579; pg1.619,902@2015-12-08 10:17:16.612642; > pg1.761,902@2015-12-08 10:17:16.612764; pg1.7f4,902@2015-12-08 > 10:17:16.612812; pg1.80a,902@2015-12-08 10:17:16.612820; pg1.90d,902@2015-1 > 2-08 10:17:16.612902; pg1.d81,902@2015-12-08 10:17:16.613179; > pg1.dc3,902@2015-12-08 10:17:16.613198; pg1.df9,902@2015-12-08 > 10:17:16.613212; pg1.e21,902@2015-12-08 10:17:16.613222; > pg1.f4a,902@2015-12-08 10:17: > 16.613299; pg1.fa8,902@2015-12-08 10:17:16.613322; pg2.61,903@2015-12-08 > 10:17:17.753976; pg2.aa,903@2015-12-08 10:17:17.754024; > pg2.106,903@2015-12-08 10:17:17.754086; pg3.a6,904@2015-12-08 > 10:17:18.835306; pg3 > .113,904@2015-12-08 10:17:18.835370; pg3.121,904@2015-12-08 > 10:17:18.835379; pg3.123,904@2015-12-08 10:17:18.835380; ) > -10> 2015-12-08 10:17:19.028871 7f7c7fe48700 5 -- op tracker -- > seq: 207, time: 2015-12-08 10:17:19.028871, event: wait for new map, op: > osd_pg_create(pg1.8d,902@2015-12-08 10:17:16.612111; pg1.a6,902@2015-1 > 2-08 10:17:16.612121; pg1.f0,902@2015-12-08 10:17:16.612151; > pg1.1f5,902@2015-12-08 10:17:16.612249; pg1.232,902@2015-12-08 > 10:17:16.612272; pg1.250,902@2015-12-08 10:17:16.612290; > pg1.27a,902@2015-12-08 10:17:1 > 6.612303; pg1.2f8,902@2015-12-08 10:17:16.612363; pg1.322,902@2015-12-08 > 10:17:16.612381; pg1.323,902@2015-12-08 10:17:16.612381; > pg1.421,902@2015-12-08 10:17:16.612487; pg1.44d,902@2015-12-08 > 10:17:16.612505; p > g1.55e,902@2015-12-08 10:17:16.612579; pg1.619,902@2015-12-08 > 10:17:16.612642; pg1.761,902@2015-12-08 10:17:16.612764; > pg1.7f4,902@2015-12-08 10:17:16.612812; pg1.80a,902@2015-12-08 > 10:17:16.612820; pg1.90d,902@ > 2015-12-08 10:17:16.612902; pg1.d81,902@2015-12-08 10:17:16.613179; > pg1.dc3,902@2015-12-08 10:17:16.613198; pg1.df9,902@2015-12-08 > 10:17:16.613212; pg1.e21,902@2015-12-08 10:17:16.613222; > pg1.f4a,902@2015-12-08 > 10:17:16.613299; pg1.fa8,902@2015-12-08 10:17:16.613322; > pg2.61,903@2015-12-08 10:17:17.753976; pg2.aa,903@2015-12-08 > 10:17:17.754024; pg2.106,903@2015-12-08 10:17:17.754086; > pg3.a6,904@2015-12-08 10:17:18.83530 > 6; pg3.113,904@2015-12-08 10:17:18.835370; pg3.121,904@2015-12-08 > 10:17:18.835379; pg3.123,904@2015-12-08 10:17:18.835380; ) > -9> 2015-12-08 10:17:19.028934 7f7c7fe48700 1 -- > 10.143.20.31:6948/26671 <== mon.0 10.143.20.31:6789/0 1046 ==== > osd_map(904..904 src has 251..904) v3 ==== 671+0+0 (150079995 0 0) > 0x7f7c96edfcc0 con 0x7f7c9 > 60c1340 > -8> 2015-12-08 10:17:19.028936 7f7c7e645700 5 -- op tracker -- > seq: 208, time: 2015-12-08 10:17:18.836032, event: header_read, op: > pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904) > -7> 2015-12-08 10:17:19.028946 7f7c7e645700 5 -- op tracker -- > seq: 208, time: 2015-12-08 10:17:18.836034, event: throttled, op: > pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904) > -6> 2015-12-08 10:17:19.028951 7f7c7e645700 5 -- op tracker -- > seq: 208, time: 2015-12-08 10:17:18.836069, event: all_read, op: > pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904) > -5> 2015-12-08 10:17:19.028957 7f7c7e645700 5 -- op tracker -- > seq: 208, time: 2015-12-08 10:17:19.028592, event: dispatched, op: > pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904) > -4> 2015-12-08 10:17:19.028962 7f7c7e645700 5 -- op tracker -- > seq: 208, time: 2015-12-08 10:17:19.028962, event: wait for new map, op: > pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904) > -3> 2015-12-08 10:17:19.028973 7f7c7e645700 1 -- > 10.143.20.31:6949/26671 <== osd.79 10.143.20.32:6917/9273 4 ==== > osd_map(903..903 src has 251..903) v3 ==== 1643+0+0 (496228184 0 0) > 0x7f7c962e8ac0 con 0x7f7 > c964831e0 > -2> 2015-12-08 10:17:19.029014 7f7c7fe48700 3 osd.37 903 > handle_osd_map epochs [904,904], i have 903, src has [251,904] > -1> 2015-12-08 10:17:19.030416 7f7c7ae3e700 -1 common/Thread.cc: In > function 'void Thread::create(size_t)' thread 7f7c7ae3e700 time > 2015-12-08 10:17:19.029219 > common/Thread.cc: 154: FAILED assert(ret == 0) > > ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x85) [0x7f7c92cd1be5] > 2: (Thread::create(unsigned long)+0x8a) [0x7f7c92cb625a] > 3: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x7f7c92cabf0f] > 4: (Accepter::entry()+0x365) [0x7f7c92d6f155] > 5: (()+0x7df5) [0x7f7c90d66df5] > 6: (clone()+0x6d) [0x7f7c8f60f1ad] > > > > Thanks for helping ! > > Kenneth > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com