Ok, have further narrowed this down, when running the OSD in the foreground with “usr/bin/ceph-osd -i 35 --pid-file /var/run/ceph/osd.35.pid -c /etc/ceph/ceph.conf --cluster ceph –f” I see the following, it looks like I’m now chasing down 'ceph::buffer::malformed_input' and ‘PGLog::read_log’ [root@cephosd4 ~]# /usr/bin/ceph-osd -i 35 --pid-file /var/run/ceph/osd.35.pid -c /etc/ceph/ceph.conf --cluster ceph -f starting osd.35 at :/0 osd_data /var/lib/ceph/osd/ceph-35 /var/lib/ceph/osd/ceph-35/journal terminate called after throwing an instance of 'ceph::buffer::malformed_input' what(): buffer::malformed_input: bad checksum on pg_log_entry_t *** Caught signal (Aborted) ** in thread 7f2eeaef7880 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) 1: /usr/bin/ceph-osd() [0xac5c32] 2: (()+0xf130) [0x7f2ee9885130] 3: (gsignal()+0x37) [0x7f2ee829f5d7] 4: (abort()+0x148) [0x7f2ee82a0cc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f2ee8ba39b5] 6: (()+0x5e926) [0x7f2ee8ba1926] 7: (()+0x5e953) [0x7f2ee8ba1953] 8: (()+0x5eb73) [0x7f2ee8ba1b73] 9: (pg_log_entry_t::decode_with_checksum(ceph::buffer::list::iterator&)+0x230) [0x792510] 10: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, std::map<eversion_t, hobject_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&,
pg_missing_t&, std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*)+0xa2f) [0x76b9bf] 11: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x34f) [0x7edebf] 12: (OSD::load_pgs()+0xa7a) [0x6b6a8a] 13: (OSD::init()+0x729) [0x6b9319] 14: (main()+0x27f3) [0x643ed3] 15: (__libc_start_main()+0xf5) [0x7f2ee828baf5] 16: /usr/bin/ceph-osd() [0x65d139] 2016-04-05 12:00:29.990938 7f2eeaef7880 -1 *** Caught signal (Aborted) ** in thread 7f2eeaef7880 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) 1: /usr/bin/ceph-osd() [0xac5c32] 2: (()+0xf130) [0x7f2ee9885130] 3: (gsignal()+0x37) [0x7f2ee829f5d7] 4: (abort()+0x148) [0x7f2ee82a0cc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f2ee8ba39b5] 6: (()+0x5e926) [0x7f2ee8ba1926] 7: (()+0x5e953) [0x7f2ee8ba1953] 8: (()+0x5eb73) [0x7f2ee8ba1b73] 9: (pg_log_entry_t::decode_with_checksum(ceph::buffer::list::iterator&)+0x230) [0x792510] 10: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, std::map<eversion_t, hobject_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&,
pg_missing_t&, std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*)+0xa2f) [0x76b9bf] 11: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x34f) [0x7edebf] 12: (OSD::load_pgs()+0xa7a) [0x6b6a8a] 13: (OSD::init()+0x729) [0x6b9319] 14: (main()+0x27f3) [0x643ed3] 15: (__libc_start_main()+0xf5) [0x7f2ee828baf5] 16: /usr/bin/ceph-osd() [0x65d139] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 0> 2016-04-05 12:00:29.990938 7f2eeaef7880 -1 *** Caught signal (Aborted) ** in thread 7f2eeaef7880 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) 1: /usr/bin/ceph-osd() [0xac5c32] 2: (()+0xf130) [0x7f2ee9885130] 3: (gsignal()+0x37) [0x7f2ee829f5d7] 4: (abort()+0x148) [0x7f2ee82a0cc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f2ee8ba39b5] 6: (()+0x5e926) [0x7f2ee8ba1926] 7: (()+0x5e953) [0x7f2ee8ba1953] 8: (()+0x5eb73) [0x7f2ee8ba1b73] 9: (pg_log_entry_t::decode_with_checksum(ceph::buffer::list::iterator&)+0x230) [0x792510] 10: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, std::map<eversion_t, hobject_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&,
pg_missing_t&, std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*)+0xa2f) [0x76b9bf] 11: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x34f) [0x7edebf] 12: (OSD::load_pgs()+0xa7a) [0x6b6a8a] 13: (OSD::init()+0x729) [0x6b9319] 14: (main()+0x27f3) [0x643ed3] 15: (__libc_start_main()+0xf5) [0x7f2ee828baf5] 16: /usr/bin/ceph-osd() [0x65d139] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Aborted [root@cephosd4 ~]# From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Cory Hawkless Ok, we seem to have traced the issue down to this “ -4> 2016-04-05 00:59:37.790223 7fbde47a0880 20 read_log 1270'2332 (1270'911) delete 50686c09/rb.0.6f57.2506ccfd.00000000bb2a/head//34 by client.95981.0:47944 2015-10-13 16:47:23.620857 -3> 2016-04-05 00:59:37.790236 7fbde47a0880 20 read_log 1270'2333 (1270'917) delete 7c7a1409/rb.0.6f57.2506ccfd.00000000bcc6/head//34 by client.95981.0:48356 2015-10-13 16:47:24.523262 -2> 2016-04-05 00:59:37.790256 7fbde47a0880 20 read_log 1270'2334 (1270'983) delete 483dbe09/rb.0.6f57.2506ccfd.00000000c9e9/head//34 by client.95981.0:51720 2015-10-13 16:47:31.584863 -1> 2016-04-05 00:59:37.790269 7fbde47a0880 20 read_log 1270'2335 (1270'985) delete 85411809/rb.0.6f57.2506ccfd.00000000cd66/head//34 by client.95981.0:52614 2015-10-13 16:47:33.278702 0> 2016-04-05 00:59:37.793431 7fbde47a0880 -1 *** Caught signal (Aborted) ** in thread 7fbde47a0880 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) 1: /usr/bin/ceph-osd() [0xac5c32] 2: (()+0xf130) [0x7fbde312e130] 3: (gsignal()+0x37) [0x7fbde1b485d7] 4: (abort()+0x148) [0x7fbde1b49cc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [“ Here is a link to the full log of the OSD starting with debugging enabled https://www.dropbox.com/s/i1x2vqcgp7qkr96/ceph-osd.35-3.log?dl=0 From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Cory Hawkless I’ve moved a whole set of OSD’s from one server to another after a filure. All other OSD’s have come up no problems and Ceph is busy rebalancing itself after the change, but one of my OSD’s is being difficult. Below is the output from the log file when I use ‘service ceph restart osd.35’ to try and bring it online. I can see any usefull errors here to start chasing down, can you? -32> 2016-04-04 16:07:13.425771 7f28d457f880 2 journal read_entry 4048896000 : seq 283263182 137 bytes -31> 2016-04-04 16:07:13.425780 7f28d457f880 2 journal read_entry 4048900096 : seq 283263183 137 bytes -30> 2016-04-04 16:07:13.425791 7f28d457f880 2 journal No further valid entries found, journal is most likely valid -29> 2016-04-04 16:07:13.425802 7f28d457f880 2 journal No further valid entries found, journal is most likely valid -28> 2016-04-04 16:07:13.425809 7f28d457f880 3 journal journal_replay: end of journal, done. -27> 2016-04-04 16:07:13.431977 7f28d457f880 1 journal _open /var/lib/ceph/osd/ceph-35/journal fd 20: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1 -26> 2016-04-04 16:07:13.432317 7f28d457f880 2 osd.35 0 boot -25> 2016-04-04 16:07:13.462245 7f28d457f880 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello -24> 2016-04-04 16:07:13.463479 7f28d457f880 1 <cls> cls/log/cls_log.cc:312: Loaded log class! -23> 2016-04-04 16:07:13.468017 7f28d457f880 1 <cls> cls/refcount/cls_refcount.cc:231: Loaded refcount class! -22> 2016-04-04 16:07:13.468228 7f28d457f880 1 <cls> cls/replica_log/cls_replica_log.cc:141: Loaded replica log class! -21> 2016-04-04 16:07:13.472035 7f28d457f880 1 <cls> cls/rgw/cls_rgw.cc:3047: Loaded rgw class! -20> 2016-04-04 16:07:13.472480 7f28d457f880 1 <cls> cls/statelog/cls_statelog.cc:306: Loaded log class! -19> 2016-04-04 16:07:13.472890 7f28d457f880 1 <cls> cls/user/cls_user.cc:367: Loaded user class! -18> 2016-04-04 16:07:13.473231 7f28d457f880 1 <cls> cls/version/cls_version.cc:227: Loaded version class! -17> 2016-04-04 16:07:13.565117 7f28d457f880 0 osd.35 3612 crush map has features 1107558400, adjusting msgr requires for clients -16> 2016-04-04 16:07:13.565139 7f28d457f880 0 osd.35 3612 crush map has features 1107558400 was 8705, adjusting msgr requires for mons -15> 2016-04-04 16:07:13.565149 7f28d457f880 0 osd.35 3612 crush map has features 1107558400, adjusting msgr requires for osds -14> 2016-04-04 16:07:13.565171 7f28d457f880 0 osd.35 3612 load_pgs -13> 2016-04-04 16:07:36.965045 7f28d457f880 5 osd.35 pg_epoch: 3604 pg[34.2(unlocked)] enter Initial -12> 2016-04-04 16:07:36.965642 7f28d457f880 5 osd.35 pg_epoch: 3604 pg[34.2( empty local-les=0 n=0 ec=709 les/c 3578/3582 3588/3588/3542) [38,35] r=1 lpr=0 pi=3542-3587/1 crt=0'0 inactive NOTIFY] exit Initial 0.000598 0 0.000000 -11> 2016-04-04 16:07:36.965691 7f28d457f880 5 osd.35 pg_epoch: 3604 pg[34.2( empty local-les=0 n=0 ec=709 les/c 3578/3582 3588/3588/3542) [38,35] r=1 lpr=0 pi=3542-3587/1 crt=0'0 inactive NOTIFY] enter Reset -10> 2016-04-04 16:07:36.987602 7f28d457f880 5 osd.35 pg_epoch: 3587 pg[34.3(unlocked)] enter Initial -9> 2016-04-04 16:07:37.011658 7f28d457f880 5 osd.35 pg_epoch: 3587 pg[34.3( v 3334'2172 (0'0,3334'2172] local-les=3562 n=0 ec=709 les/c 3562/3587 3516/3516/3459) [35,34] r=0 lpr=0 crt=3334'2170 lcod 0'0 mlcod 0'0 inactive] exit Initial
0.024056 0 0.000000 -8> 2016-04-04 16:07:37.011696 7f28d457f880 5 osd.35 pg_epoch: 3587 pg[34.3( v 3334'2172 (0'0,3334'2172] local-les=3562 n=0 ec=709 les/c 3562/3587 3516/3516/3459) [35,34] r=0 lpr=0 crt=3334'2170 lcod 0'0 mlcod 0'0 inactive] enter Reset -7> 2016-04-04 16:07:37.067070 7f28d457f880 5 osd.35 pg_epoch: 3481 pg[34.7(unlocked)] enter Initial -6> 2016-04-04 16:07:37.097469 7f28d457f880 5 osd.35 pg_epoch: 3481 pg[34.7( v 3334'2635 (0'0,3334'2635] local-les=3465 n=0 ec=709 les/c 3465/3479 3459/3459/3369) [34,35] r=1 lpr=0 pi=3448-3458/1 crt=3334'2635 lcod 0'0 inactive NOTIFY]
exit Initial 0.030398 0 0.000000 -5> 2016-04-04 16:07:37.097531 7f28d457f880 5 osd.35 pg_epoch: 3481 pg[34.7( v 3334'2635 (0'0,3334'2635] local-les=3465 n=0 ec=709 les/c 3465/3479 3459/3459/3369) [34,35] r=1 lpr=0 pi=3448-3458/1 crt=3334'2635 lcod 0'0 inactive NOTIFY]
enter Reset -4> 2016-04-04 16:07:37.146741 7f28d457f880 5 osd.35 pg_epoch: 3582 pg[34.8(unlocked)] enter Initial -3> 2016-04-04 16:07:37.172346 7f28d457f880 5 osd.35 pg_epoch: 3582 pg[34.8( v 3334'2042 (0'0,3334'2042] lb 0//0//-1 local-les=3494 n=0 ec=709 les/c 3494/3502 3542/3542/3368) [33,38] r=-1 lpr=0 pi=3368-3541/4 crt=3334'2042 lcod 0'0
inactive NOTIFY] exit Initial 0.025604 0 0.000000 -2> 2016-04-04 16:07:37.172387 7f28d457f880 5 osd.35 pg_epoch: 3582 pg[34.8( v 3334'2042 (0'0,3334'2042] lb 0//0//-1 local-les=3494 n=0 ec=709 les/c 3494/3502 3542/3542/3368) [33,38] r=-1 lpr=0 pi=3368-3541/4 crt=3334'2042 lcod 0'0
inactive NOTIFY] enter Reset -1> 2016-04-04 16:07:37.189150 7f28d457f880 5 osd.35 pg_epoch: 3585 pg[34.9(unlocked)] enter Initial 0> 2016-04-04 16:07:37.268571 7f28d457f880 -1 *** Caught signal (Aborted) ** in thread 7f28d457f880 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) 1: /usr/bin/ceph-osd() [0xac5c32] 2: (()+0xf130) [0x7f28d2f0d130] 3: (gsignal()+0x37) [0x7f28d19275d7] 4: (abort()+0x148) [0x7f28d1928cc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f28d222b9b5] 6: (()+0x5e926) [0x7f28d2229926] 7: (()+0x5e953) [0x7f28d2229953] 8: (()+0x5eb73) [0x7f28d2229b73] 9: (pg_log_entry_t::decode_with_checksum(ceph::buffer::list::iterator&)+0x230) [0x792510] 10: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, std::map<eversion_t, hobject_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&, std::basic_ostringstream<char,
std::char_traits<char>, std::allocator<char> >&, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*)+0xa2f) [0x76b9bf] 11: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x34f) [0x7edebf] 12: (OSD::load_pgs()+0xa7a) [0x6b6a8a] 13: (OSD::init()+0x729) [0x6b9319] 14: (main()+0x27f3) [0x643ed3] 15: (__libc_start_main()+0xf5) [0x7f28d1913af5] 16: /usr/bin/ceph-osd() [0x65d139] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.35.log --- end dump of recent events --- [root@cephosd4 ~]# |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com