Hi,
I installed ceph 0.9.2 on a new cluster of 3 nodes, with 50 OSDs on each
node (300GB disks, 96GB RAM)
While installing, I got some issue that I even could not login as ceph
user. So I increased some limits:
security/limits.conf
ceph - nproc 1048576
ceph - nofile 1048576
I could then install the other OSDs.
After the cluster was installed, I added some extra pools. when creating
the pgs of these pools, the osds of the cluster started to fail, with
stacktraces. If I try to restart them, they keep on failing. I don't
know if this is an actual bug of Infernalis, or a limit that is still
not high enough.. I've increased the noproc and nofile entries even
more, but no luck. Someone has a clue? Hereby the stacktraces I see:
Mostly this one:
-12> 2015-12-08 10:17:18.995243 7fa9063c5700 5 osd.12 pg_epoch: 904
pg[3.3b(unlocked)] enter Initial
-11> 2015-12-08 10:17:18.995279 7fa9063c5700 5 write_log with:
dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
dirty_divergent_priors: false, divergent_priors: 0, writeout_from:
4294967295'184467
44073709551615, trimmed:
-10> 2015-12-08 10:17:18.995292 7fa9063c5700 5 osd.12 pg_epoch: 904
pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
[12,80,111] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive] exit Initial 0.000048
0 0.000000
-9> 2015-12-08 10:17:18.995301 7fa9063c5700 5 osd.12 pg_epoch: 904
pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
[12,80,111] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive] enter Reset
-8> 2015-12-08 10:17:18.995310 7fa9063c5700 5 osd.12 pg_epoch: 904
pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
[12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] exit Reset 0.000008
1 0.000017
-7> 2015-12-08 10:17:18.995326 7fa9063c5700 5 osd.12 pg_epoch: 904
pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
[12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] enter Started
-6> 2015-12-08 10:17:18.995332 7fa9063c5700 5 osd.12 pg_epoch: 904
pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
[12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] enter Start
-5> 2015-12-08 10:17:18.995338 7fa9063c5700 1 osd.12 pg_epoch: 904
pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
[12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] state<Start>: transi
tioning to Primary
-4> 2015-12-08 10:17:18.995345 7fa9063c5700 5 osd.12 pg_epoch: 904
pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
[12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] exit Start 0.000012
0 0.000000
-3> 2015-12-08 10:17:18.995352 7fa9063c5700 5 osd.12 pg_epoch: 904
pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
[12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 inactive] enter Started/Primar
y
-2> 2015-12-08 10:17:18.995358 7fa9063c5700 5 osd.12 pg_epoch: 904
pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
[12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 creating] enter Started/Primar
y/Peering
-1> 2015-12-08 10:17:18.995365 7fa9063c5700 5 osd.12 pg_epoch: 904
pg[3.3b( empty local-les=0 n=0 ec=904 les/c/f 0/904/0 904/904/904)
[12,80,111] r=0 lpr=904 crt=0'0 mlcod 0'0 creating+peering] enter Starte
d/Primary/Peering/GetInfo
0> 2015-12-08 10:17:18.998472 7fa9063c5700 -1 common/Thread.cc: In
function 'void Thread::create(size_t)' thread 7fa9063c5700 time
2015-12-08 10:17:18.995438
common/Thread.cc: 154: FAILED assert(ret == 0)
ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x7fa91924ebe5]
2: (Thread::create(unsigned long)+0x8a) [0x7fa91923325a]
3: (SimpleMessenger::connect_rank(entity_addr_t const&, int,
PipeConnection*, Message*)+0x185) [0x7fa919229105]
4: (SimpleMessenger::get_connection(entity_inst_t const&)+0x3ba)
[0x7fa9192298ea]
5: (OSDService::get_con_osd_cluster(int, unsigned int)+0x1ab)
[0x7fa918c7318b]
6: (OSD::do_queries(std::map<int, std::map<spg_t, pg_query_t,
std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > >,
std::less<int>, std::allocator<std::pair<int const, std::map<spg_t,
pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const,
pg_query_t> > > > > >&, std::shared_ptr<OSDMap const>)+0x1f1)
[0x7fa918c9b061]
7: (OSD::dispatch_context(PG::RecoveryCtx&, PG*,
std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x142)
[0x7fa918cb5832]
8: (OSD::handle_pg_create(std::shared_ptr<OpRequest>)+0x133e)
[0x7fa918cb820e]
9: (OSD::dispatch_op(std::shared_ptr<OpRequest>)+0x220) [0x7fa918cbc0c0]
10: (OSD::do_waiters()+0x1c2) [0x7fa918cbc382]
11: (OSD::ms_dispatch(Message*)+0x227) [0x7fa918cbd727]
12: (DispatchQueue::entry()+0x649) [0x7fa91930a939]
13: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fa91922eb1d]
14: (()+0x7df5) [0x7fa9172e3df5]
15: (clone()+0x6d) [0x7fa915b8c1ad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
Also these:
--- begin dump of recent events ---
-13> 2015-12-08 10:17:19.033845 7f409fa08700 5 osd.15 pg_epoch: 903
pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902)
[15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte
d/Primary/Peering/GetInfo 2.225918 4 0.000124
-12> 2015-12-08 10:17:19.033874 7f409fa08700 5 osd.15 pg_epoch: 903
pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902)
[15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] enter Start
ed/Primary/Peering/GetLog
-11> 2015-12-08 10:17:19.033920 7f409fa08700 5 osd.15 pg_epoch: 903
pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902)
[15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte
d/Primary/Peering/GetLog 0.000046 0 0.000000
-10> 2015-12-08 10:17:19.033936 7f409fa08700 5 osd.15 pg_epoch: 903
pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902)
[15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] enter Start
ed/Primary/Peering/GetMissing
-9> 2015-12-08 10:17:19.033949 7f409fa08700 5 osd.15 pg_epoch: 903
pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902)
[15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte
d/Primary/Peering/GetMissing 0.000013 0 0.000000
-8> 2015-12-08 10:17:19.033962 7f409fa08700 5 osd.15 pg_epoch: 903
pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902)
[15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating+peering] exit Starte
d/Primary/Peering 2.226044 0 0.000000
-7> 2015-12-08 10:17:19.033975 7f409fa08700 5 osd.15 pg_epoch: 903
pg[1.d42( empty local-les=0 n=0 ec=902 les/c/f 0/902/0 902/902/902)
[15,103,82] r=0 lpr=902 crt=0'0 mlcod 0'0 creating] enter Started/Prima
ry/Active
-6> 2015-12-08 10:17:19.060423 7f40a4a12700 1 --
10.143.20.31:6863/8526 <== osd.94 10.143.20.32:0/13947 2 ====
osd_ping(ping e903 stamp 2015-12-08 10:17:19.059261) v2 ==== 47+0+0
(3897539321 0 0) 0x7f40bffa
b400 con 0x7f40c3baf8c0
-5> 2015-12-08 10:17:19.060447 7f40a4a12700 1 --
10.143.20.31:6863/8526 --> 10.143.20.32:0/13947 -- osd_ping(ping_reply
e903 stamp 2015-12-08 10:17:19.059261) v2 -- ?+0 0x7f40c33f4000 con
0x7f40c3baf8c0
-4> 2015-12-08 10:17:19.060573 7f40a320f700 1 --
10.143.20.31:6862/8526 <== osd.94 10.143.20.32:0/13947 2 ====
osd_ping(ping e903 stamp 2015-12-08 10:17:19.059261) v2 ==== 47+0+0
(3897539321 0 0) 0x7f40bffa
b000 con 0x7f40c3bb1860
-3> 2015-12-08 10:17:19.069801 7f40a0a0a700 10 monclient: tick
-2> 2015-12-08 10:17:19.069814 7f40a0a0a700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after 2015-12-08
10:16:49.069813)
-1> 2015-12-08 10:17:19.069820 7f40a0a0a700 10 monclient: renew
subs? (now: 2015-12-08 10:17:19.069820; renew after: 2015-12-08
10:19:46.766797) -- no
0> 2015-12-08 10:17:19.121951 7f40a6215700 -1 *** Caught signal
(Aborted) **
in thread 7f40a6215700
ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
1: (()+0x7e6ab2) [0x7f40bb7aeab2]
2: (()+0xf130) [0x7f40b9940130]
3: (gsignal()+0x37) [0x7f40b81205d7]
4: (abort()+0x148) [0x7f40b8121cc8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f40b8a249b5]
6: (()+0x5e926) [0x7f40b8a22926]
7: (()+0x5e953) [0x7f40b8a22953]
8: (()+0x5eb73) [0x7f40b8a22b73]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27a) [0x7f40bb8a3dda]
10: (Thread::create(unsigned long)+0x8a) [0x7f40bb88825a]
11: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x7f40bb87df0f]
12: (Accepter::entry()+0x365) [0x7f40bb941155]
13: (()+0x7df5) [0x7f40b9938df5]
14: (clone()+0x6d) [0x7f40b81e11ad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
-11> 2015-12-08 10:17:19.028810 7f7c7fe48700 5 -- op tracker --
seq: 207, time: 2015-12-08 10:17:18.969336, event: dispatched, op:
osd_pg_create(pg1.8d,902@2015-12-08 10:17:16.612111; pg1.a6,902@2015-12-08 1
0:17:16.612121; pg1.f0,902@2015-12-08 10:17:16.612151;
pg1.1f5,902@2015-12-08 10:17:16.612249; pg1.232,902@2015-12-08
10:17:16.612272; pg1.250,902@2015-12-08 10:17:16.612290;
pg1.27a,902@2015-12-08 10:17:16.6123
03; pg1.2f8,902@2015-12-08 10:17:16.612363; pg1.322,902@2015-12-08
10:17:16.612381; pg1.323,902@2015-12-08 10:17:16.612381;
pg1.421,902@2015-12-08 10:17:16.612487; pg1.44d,902@2015-12-08
10:17:16.612505; pg1.55e
,902@2015-12-08 10:17:16.612579; pg1.619,902@2015-12-08 10:17:16.612642;
pg1.761,902@2015-12-08 10:17:16.612764; pg1.7f4,902@2015-12-08
10:17:16.612812; pg1.80a,902@2015-12-08 10:17:16.612820; pg1.90d,902@2015-1
2-08 10:17:16.612902; pg1.d81,902@2015-12-08 10:17:16.613179;
pg1.dc3,902@2015-12-08 10:17:16.613198; pg1.df9,902@2015-12-08
10:17:16.613212; pg1.e21,902@2015-12-08 10:17:16.613222;
pg1.f4a,902@2015-12-08 10:17:
16.613299; pg1.fa8,902@2015-12-08 10:17:16.613322; pg2.61,903@2015-12-08
10:17:17.753976; pg2.aa,903@2015-12-08 10:17:17.754024;
pg2.106,903@2015-12-08 10:17:17.754086; pg3.a6,904@2015-12-08
10:17:18.835306; pg3
.113,904@2015-12-08 10:17:18.835370; pg3.121,904@2015-12-08
10:17:18.835379; pg3.123,904@2015-12-08 10:17:18.835380; )
-10> 2015-12-08 10:17:19.028871 7f7c7fe48700 5 -- op tracker --
seq: 207, time: 2015-12-08 10:17:19.028871, event: wait for new map, op:
osd_pg_create(pg1.8d,902@2015-12-08 10:17:16.612111; pg1.a6,902@2015-1
2-08 10:17:16.612121; pg1.f0,902@2015-12-08 10:17:16.612151;
pg1.1f5,902@2015-12-08 10:17:16.612249; pg1.232,902@2015-12-08
10:17:16.612272; pg1.250,902@2015-12-08 10:17:16.612290;
pg1.27a,902@2015-12-08 10:17:1
6.612303; pg1.2f8,902@2015-12-08 10:17:16.612363; pg1.322,902@2015-12-08
10:17:16.612381; pg1.323,902@2015-12-08 10:17:16.612381;
pg1.421,902@2015-12-08 10:17:16.612487; pg1.44d,902@2015-12-08
10:17:16.612505; p
g1.55e,902@2015-12-08 10:17:16.612579; pg1.619,902@2015-12-08
10:17:16.612642; pg1.761,902@2015-12-08 10:17:16.612764;
pg1.7f4,902@2015-12-08 10:17:16.612812; pg1.80a,902@2015-12-08
10:17:16.612820; pg1.90d,902@
2015-12-08 10:17:16.612902; pg1.d81,902@2015-12-08 10:17:16.613179;
pg1.dc3,902@2015-12-08 10:17:16.613198; pg1.df9,902@2015-12-08
10:17:16.613212; pg1.e21,902@2015-12-08 10:17:16.613222;
pg1.f4a,902@2015-12-08
10:17:16.613299; pg1.fa8,902@2015-12-08 10:17:16.613322;
pg2.61,903@2015-12-08 10:17:17.753976; pg2.aa,903@2015-12-08
10:17:17.754024; pg2.106,903@2015-12-08 10:17:17.754086;
pg3.a6,904@2015-12-08 10:17:18.83530
6; pg3.113,904@2015-12-08 10:17:18.835370; pg3.121,904@2015-12-08
10:17:18.835379; pg3.123,904@2015-12-08 10:17:18.835380; )
-9> 2015-12-08 10:17:19.028934 7f7c7fe48700 1 --
10.143.20.31:6948/26671 <== mon.0 10.143.20.31:6789/0 1046 ====
osd_map(904..904 src has 251..904) v3 ==== 671+0+0 (150079995 0 0)
0x7f7c96edfcc0 con 0x7f7c9
60c1340
-8> 2015-12-08 10:17:19.028936 7f7c7e645700 5 -- op tracker --
seq: 208, time: 2015-12-08 10:17:18.836032, event: header_read, op:
pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904)
-7> 2015-12-08 10:17:19.028946 7f7c7e645700 5 -- op tracker --
seq: 208, time: 2015-12-08 10:17:18.836034, event: throttled, op:
pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904)
-6> 2015-12-08 10:17:19.028951 7f7c7e645700 5 -- op tracker --
seq: 208, time: 2015-12-08 10:17:18.836069, event: all_read, op:
pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904)
-5> 2015-12-08 10:17:19.028957 7f7c7e645700 5 -- op tracker --
seq: 208, time: 2015-12-08 10:17:19.028592, event: dispatched, op:
pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904)
-4> 2015-12-08 10:17:19.028962 7f7c7e645700 5 -- op tracker --
seq: 208, time: 2015-12-08 10:17:19.028962, event: wait for new map, op:
pg_log(2.ce epoch 904 log log((0'0,0'0], crt=0'0) query_epoch 904)
-3> 2015-12-08 10:17:19.028973 7f7c7e645700 1 --
10.143.20.31:6949/26671 <== osd.79 10.143.20.32:6917/9273 4 ====
osd_map(903..903 src has 251..903) v3 ==== 1643+0+0 (496228184 0 0)
0x7f7c962e8ac0 con 0x7f7
c964831e0
-2> 2015-12-08 10:17:19.029014 7f7c7fe48700 3 osd.37 903
handle_osd_map epochs [904,904], i have 903, src has [251,904]
-1> 2015-12-08 10:17:19.030416 7f7c7ae3e700 -1 common/Thread.cc: In
function 'void Thread::create(size_t)' thread 7f7c7ae3e700 time
2015-12-08 10:17:19.029219
common/Thread.cc: 154: FAILED assert(ret == 0)
ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x7f7c92cd1be5]
2: (Thread::create(unsigned long)+0x8a) [0x7f7c92cb625a]
3: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0x7f7c92cabf0f]
4: (Accepter::entry()+0x365) [0x7f7c92d6f155]
5: (()+0x7df5) [0x7f7c90d66df5]
6: (clone()+0x6d) [0x7f7c8f60f1ad]
Thanks for helping !
Kenneth
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com