ceph new installation of ceph 0.9.2 issue and crashing osds

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I installed ceph 0.9.2 on a new cluster of 3 nodes, with 50 OSDs on each node (300GB disks, 96GB RAM)

While installing, I got some issue that I even could not login as ceph user. So I increased some limits:
 security/limits.conf

ceph            -       nproc           1048576
ceph            -       nofile                 1048576

I could then install the other OSDs.

After the cluster was installed, I added some extra pools. when creating the pgs of these pools, the osds of the cluster started to fail, with stacktraces. If I try to restart them, they keep on failing. I don't know if this is an actual bug of Infernalis, or a limit that is still not high enough.. I've increased the noproc and nofile entries even more, but no luck. Someone has a clue? Hereby the stacktraces I see:

Mostly this one:

-12> 2015-12-08 10:17:18.995243 7fa9063c5700 5 osd.12 pg_epoch: 904 pg[3.3b(unlocked)] enter Initial -11> 2015-12-08 10:17:18.995279 7fa9063c5700 5 write_log with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: false, divergent_priors: 0, writeout_from: 4294967295'184467
44073709551615, trimmed:
-10> 2015-12-08 10:17:18.995292 7fa9063c5700 5 osd.12 pg_epoch: 904 pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) [12,80,111] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive] exit Initial 0.000048
0 0.000000
-9> 2015-12-08 10:17:18.995301 7fa9063c5700 5 osd.12 pg_epoch: 904 pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) [12,80,111] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive] enter Reset -8> 2015-12-08 10:17:18.995310 7fa9063c5700 5 osd.12 pg_epoch: 904 pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] exit Reset 0.000008
1 0.000017
-7> 2015-12-08 10:17:18.995326 7fa9063c5700 5 osd.12 pg_epoch: 904 pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] enter Started -6> 2015-12-08 10:17:18.995332 7fa9063c5700 5 osd.12 pg_epoch: 904 pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] enter Start -5> 2015-12-08 10:17:18.995338 7fa9063c5700 1 osd.12 pg_epoch: 904 pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] state<Start>: transi
tioning to Primary
-4> 2015-12-08 10:17:18.995345 7fa9063c5700 5 osd.12 pg_epoch: 904 pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] exit Start 0.000012
0 0.000000
-3> 2015-12-08 10:17:18.995352 7fa9063c5700 5 osd.12 pg_epoch: 904 pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] enter Started/Primar
y
-2> 2015-12-08 10:17:18.995358 7fa9063c5700 5 osd.12 pg_epoch: 904 pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 creating] enter Started/Primar
y/Peering
-1> 2015-12-08 10:17:18.995365 7fa9063c5700 5 osd.12 pg_epoch: 904 pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904) [12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 creating+peering] enter Starte
d/Primary/Peering/GetInfo
0> 2015-12-08 10:17:18.998472 7fa9063c5700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fa9063c5700 time 2015-12-08 10:17:18.995438
common/Thread.cc: 154: FAILED assert(ret == 0)

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7fa91924ebe5]
 2: (Thread::create(unsigned long)+0x8a) [0x7fa91923325a]
3: (SimpleMessenger::connect_rank(entity_addr_t const&, int, PipeConnection*, Message*)+0x185) [0x7fa919229105] 4: (SimpleMessenger::get_connection(entity_inst_t const&)+0x3ba) [0x7fa9192298ea] 5: (OSDService::get_con_osd_cluster(int, unsigned int)+0x1ab) [0x7fa918c7318b] 6: (OSD::do_queries(std::map<int, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > >, std::less<int>, std::allocator<std::pair<int const, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > > > > >&, std::shared_ptr<OSDMap const>)+0x1f1) [0x7fa918c9b061] 7: (OSD::dispatch_context(PG::RecoveryCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x142) [0x7fa918cb5832] 8: (OSD::handle_pg_create(std::shared_ptr<OpRequest>)+0x133e) [0x7fa918cb820e]
 9: (OSD::dispatch_op(std::shared_ptr<OpRequest>)+0x220) [0x7fa918cbc0c0]
 10: (OSD::do_waiters()+0x1c2) [0x7fa918cbc382]
 11: (OSD::ms_dispatch(Message*)+0x227) [0x7fa918cbd727]
 12: (DispatchQueue::entry()+0x649) [0x7fa91930a939]
 13: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fa91922eb1d]
 14: (()+0x7df5) [0x7fa9172e3df5]
 15: (clone()+0x6d) [0x7fa915b8c1ad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Also these:

--- begin dump of recent events ---
-13> 2015-12-08 10:17:19.033845 7f409fa08700 5 osd.15 pg_epoch: 903 pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902) [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte
d/Primary/Peering/GetInfo 2.225918 4 0.000124
-12> 2015-12-08 10:17:19.033874 7f409fa08700 5 osd.15 pg_epoch: 903 pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902) [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] enter Start
ed/Primary/Peering/GetLog
-11> 2015-12-08 10:17:19.033920 7f409fa08700 5 osd.15 pg_epoch: 903 pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902) [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte
d/Primary/Peering/GetLog 0.000046 0 0.000000
-10> 2015-12-08 10:17:19.033936 7f409fa08700 5 osd.15 pg_epoch: 903 pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902) [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] enter Start
ed/Primary/Peering/GetMissing
-9> 2015-12-08 10:17:19.033949 7f409fa08700 5 osd.15 pg_epoch: 903 pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902) [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte
d/Primary/Peering/GetMissing 0.000013 0 0.000000
-8> 2015-12-08 10:17:19.033962 7f409fa08700 5 osd.15 pg_epoch: 903 pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902) [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte
d/Primary/Peering 2.226044 0 0.000000
-7> 2015-12-08 10:17:19.033975 7f409fa08700 5 osd.15 pg_epoch: 903 pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902) [15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating] enter Started/Prima
ry/Active
-6> 2015-12-08 10:17:19.060423 7f40a4a12700 1 -- 10.143.20.31:6863/8526 <== osd.94 10.143.20.32:0/13947 2 ==== osd_ping(ping e903 stamp 2015-12-08 10:17:19.059261) v2 ==== 47+0+0 (3897539321 0 0) 0x7f40bffa
b400 con 0x7f40c3baf8c0
-5> 2015-12-08 10:17:19.060447 7f40a4a12700 1 -- 10.143.20.31:6863/8526 --> 10.143.20.32:0/13947 -- osd_ping(ping_reply e903 stamp 2015-12-08 10:17:19.059261) v2 -- ?+0 0x7f40c33f4000 con 0x7f40c3baf8c0 -4> 2015-12-08 10:17:19.060573 7f40a320f700 1 -- 10.143.20.31:6862/8526 <== osd.94 10.143.20.32:0/13947 2 ==== osd_ping(ping e903 stamp 2015-12-08 10:17:19.059261) v2 ==== 47+0+0 (3897539321 0 0) 0x7f40bffa
b000 con 0x7f40c3bb1860
    -3> 2015-12-08 10:17:19.069801 7f40a0a0a700 10 monclient: tick
-2> 2015-12-08 10:17:19.069814 7f40a0a0a700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2015-12-08 10:16:49.069813) -1> 2015-12-08 10:17:19.069820 7f40a0a0a700 10 monclient: renew subs? (now: 2015-12-08 10:17:19.069820; renew after: 2015-12-08 10:19:46.766797) -- no 0> 2015-12-08 10:17:19.121951 7f40a6215700 -1 *** Caught signal (Aborted) **
 in thread 7f40a6215700

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
 1: (()+0x7e6ab2) [0x7f40bb7aeab2]
 2: (()+0xf130) [0x7f40b9940130]
 3: (gsignal()+0x37) [0x7f40b81205d7]
 4: (abort()+0x148) [0x7f40b8121cc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f40b8a249b5]
 6: (()+0x5e926) [0x7f40b8a22926]
 7: (()+0x5e953) [0x7f40b8a22953]
 8: (()+0x5eb73) [0x7f40b8a22b73]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0x7f40bb8a3dda]
 10: (Thread::create(unsigned long)+0x8a) [0x7f40bb88825a]
 11: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x7f40bb87df0f]
 12: (Accepter::entry()+0x365) [0x7f40bb941155]
 13: (()+0x7df5) [0x7f40b9938df5]
 14: (clone()+0x6d) [0x7f40b81e11ad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


-11> 2015-12-08 10:17:19.028810 7f7c7fe48700 5 -- op tracker -- seq: 207, time: 2015-12-08 10:17:18.969336, event: dispatched, op: osd_pg_create(pg1.8d,902@2015-12-08 10:17:16.612111; pg1.a6,902@2015-12-08 1 0:17:16.612121; pg1.f0,902@2015-12-08 10:17:16.612151; pg1.1f5,902@2015-12-08 10:17:16.612249; pg1.232,902@2015-12-08 10:17:16.612272; pg1.250,902@2015-12-08 10:17:16.612290; pg1.27a,902@2015-12-08 10:17:16.6123 03; pg1.2f8,902@2015-12-08 10:17:16.612363; pg1.322,902@2015-12-08 10:17:16.612381; pg1.323,902@2015-12-08 10:17:16.612381; pg1.421,902@2015-12-08 10:17:16.612487; pg1.44d,902@2015-12-08 10:17:16.612505; pg1.55e ,902@2015-12-08 10:17:16.612579; pg1.619,902@2015-12-08 10:17:16.612642; pg1.761,902@2015-12-08 10:17:16.612764; pg1.7f4,902@2015-12-08 10:17:16.612812; pg1.80a,902@2015-12-08 10:17:16.612820; pg1.90d,902@2015-1 2-08 10:17:16.612902; pg1.d81,902@2015-12-08 10:17:16.613179; pg1.dc3,902@2015-12-08 10:17:16.613198; pg1.df9,902@2015-12-08 10:17:16.613212; pg1.e21,902@2015-12-08 10:17:16.613222; pg1.f4a,902@2015-12-08 10:17: 16.613299; pg1.fa8,902@2015-12-08 10:17:16.613322; pg2.61,903@2015-12-08 10:17:17.753976; pg2.aa,903@2015-12-08 10:17:17.754024; pg2.106,903@2015-12-08 10:17:17.754086; pg3.a6,904@2015-12-08 10:17:18.835306; pg3 .113,904@2015-12-08 10:17:18.835370; pg3.121,904@2015-12-08 10:17:18.835379; pg3.123,904@2015-12-08 10:17:18.835380; ) -10> 2015-12-08 10:17:19.028871 7f7c7fe48700 5 -- op tracker -- seq: 207, time: 2015-12-08 10:17:19.028871, event: wait for new map, op: osd_pg_create(pg1.8d,902@2015-12-08 10:17:16.612111; pg1.a6,902@2015-1 2-08 10:17:16.612121; pg1.f0,902@2015-12-08 10:17:16.612151; pg1.1f5,902@2015-12-08 10:17:16.612249; pg1.232,902@2015-12-08 10:17:16.612272; pg1.250,902@2015-12-08 10:17:16.612290; pg1.27a,902@2015-12-08 10:17:1 6.612303; pg1.2f8,902@2015-12-08 10:17:16.612363; pg1.322,902@2015-12-08 10:17:16.612381; pg1.323,902@2015-12-08 10:17:16.612381; pg1.421,902@2015-12-08 10:17:16.612487; pg1.44d,902@2015-12-08 10:17:16.612505; p g1.55e,902@2015-12-08 10:17:16.612579; pg1.619,902@2015-12-08 10:17:16.612642; pg1.761,902@2015-12-08 10:17:16.612764; pg1.7f4,902@2015-12-08 10:17:16.612812; pg1.80a,902@2015-12-08 10:17:16.612820; pg1.90d,902@ 2015-12-08 10:17:16.612902; pg1.d81,902@2015-12-08 10:17:16.613179; pg1.dc3,902@2015-12-08 10:17:16.613198; pg1.df9,902@2015-12-08 10:17:16.613212; pg1.e21,902@2015-12-08 10:17:16.613222; pg1.f4a,902@2015-12-08 10:17:16.613299; pg1.fa8,902@2015-12-08 10:17:16.613322; pg2.61,903@2015-12-08 10:17:17.753976; pg2.aa,903@2015-12-08 10:17:17.754024; pg2.106,903@2015-12-08 10:17:17.754086; pg3.a6,904@2015-12-08 10:17:18.83530 6; pg3.113,904@2015-12-08 10:17:18.835370; pg3.121,904@2015-12-08 10:17:18.835379; pg3.123,904@2015-12-08 10:17:18.835380; ) -9> 2015-12-08 10:17:19.028934 7f7c7fe48700 1 -- 10.143.20.31:6948/26671 <== mon.0 10.143.20.31:6789/0 1046 ==== osd_map(904..904 src has 251..904) v3 ==== 671+0+0 (150079995 0 0) 0x7f7c96edfcc0 con 0x7f7c9
60c1340
-8> 2015-12-08 10:17:19.028936 7f7c7e645700 5 -- op tracker -- seq: 208, time: 2015-12-08 10:17:18.836032, event: header_read, op: pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904) -7> 2015-12-08 10:17:19.028946 7f7c7e645700 5 -- op tracker -- seq: 208, time: 2015-12-08 10:17:18.836034, event: throttled, op: pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904) -6> 2015-12-08 10:17:19.028951 7f7c7e645700 5 -- op tracker -- seq: 208, time: 2015-12-08 10:17:18.836069, event: all_read, op: pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904) -5> 2015-12-08 10:17:19.028957 7f7c7e645700 5 -- op tracker -- seq: 208, time: 2015-12-08 10:17:19.028592, event: dispatched, op: pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904) -4> 2015-12-08 10:17:19.028962 7f7c7e645700 5 -- op tracker -- seq: 208, time: 2015-12-08 10:17:19.028962, event: wait for new map, op: pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904) -3> 2015-12-08 10:17:19.028973 7f7c7e645700 1 -- 10.143.20.31:6949/26671 <== osd.79 10.143.20.32:6917/9273 4 ==== osd_map(903..903 src has 251..903) v3 ==== 1643+0+0 (496228184 0 0) 0x7f7c962e8ac0 con 0x7f7
c964831e0
-2> 2015-12-08 10:17:19.029014 7f7c7fe48700 3 osd.37 903 handle_osd_map epochs [904,904], i have 903, src has [251,904] -1> 2015-12-08 10:17:19.030416 7f7c7ae3e700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f7c7ae3e700 time 2015-12-08 10:17:19.029219
common/Thread.cc: 154: FAILED assert(ret == 0)

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f7c92cd1be5]
 2: (Thread::create(unsigned long)+0x8a) [0x7f7c92cb625a]
 3: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x7f7c92cabf0f]
 4: (Accepter::entry()+0x365) [0x7f7c92d6f155]
 5: (()+0x7df5) [0x7f7c90d66df5]
 6: (clone()+0x6d) [0x7f7c8f60f1ad]



Thanks for helping !

Kenneth
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux