Hi Cephers At the moment we are trying to recover our CEPH cluser (0.87) which is behaving very odd. What have been done : 1. OSD drive failure happened - CEPH put OSD down and out. 2. Physical HDD replaced and NOT added to CEPH - here we had strange kernel crash just after HDD connected to the controller. 3. Physical host rebooted. 4. CEPH started restoration and putting OSD's down one by one (actually I can see osd process crush in logs). ceph.conf is in attachment. OSD failure : -4> 2016-02-26 23:20:47.906443 7f942b4b6700 5 -- op tracker -- seq: 471061, time: 2016-02-26 23:20:47.906404, even t: header_read, op: pg_backfill(progress 13.77 e 183964/183964 lb 45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13) -3> 2016-02-26 23:20:47.906451 7f942b4b6700 5 -- op tracker -- seq: 471061, time: 2016-02-26 23:20:47.906406, even t: throttled, op: pg_backfill(progress 13.77 e 183964/183964 lb 45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13) -2> 2016-02-26 23:20:47.906456 7f942b4b6700 5 -- op tracker -- seq: 471061, time: 2016-02-26 23:20:47.906421, even t: all_read, op: pg_backfill(progress 13.77 e 183964/183964 lb 45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13) -1> 2016-02-26 23:20:47.906462 7f942b4b6700 5 -- op tracker -- seq: 471061, time: 0.000000, event: dispatched, op: pg_backfill(progress 13.77 e 183964/183964 lb 45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13) 0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1 *** Caught signal (Aborted) ** in thread 7f9434e0f700 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) 1: /usr/bin/ceph-osd() [0x9e2015] 2: (()+0xfcb0) [0x7f945459fcb0] 3: (gsignal()+0x35) [0x7f94533d30d5] 4: (abort()+0x17b) [0x7f94533d683b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9453d2569d] 6: (()+0xb5846) [0x7f9453d23846] 7: (()+0xb5873) [0x7f9453d23873] 8: (()+0xb596e) [0x7f9453d2396e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x259) [0xacb979] 10: (SnapSet::get_clone_bytes(snapid_t) const+0x15f) [0x732c0f] 11: (ReplicatedPG::_scrub(ScrubMap&)+0x10c4) [0x7f5e54] 12: (PG::scrub_compare_maps()+0xcb6) [0x7876e6] 13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1c3) [0x7880b3] 14: (PG::scrub(ThreadPool::TPHandle&)+0x33d) [0x789abd] 15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x67ccf3] 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xabb3ce] 17: (ThreadPool::WorkThread::entry()+0x10) [0xabe160] 18: (()+0x7e9a) [0x7f9454597e9a] 19: (clone()+0x6d) [0x7f94534912ed] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs -1/-1 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.27.log Current OSD tree: # id weight type name up/down reweight -10 2 root ssdtree -8 1 host ibstorage01-ssd1 9 1 osd.9 up 1 -9 1 host ibstorage02-ssd1 10 1 osd.10 up 1 -1 22.99 root default -7 22.99 room cdsqv1 -3 22.99 rack gopc-rack01 -2 8 host ibstorage01-sas1 0 1 osd.0 down 0 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 down 0 7 1 osd.7 up 1 4 1 osd.4 up 1 5 1 osd.5 up 1 6 1 osd.6 up 1 -4 6.99 host ibstorage02-sas1 20 1 osd.20 down 0 21 1.03 osd.21 up 1 22 0.96 osd.22 down 0 25 1 osd.25 down 0 26 1 osd.26 up 1 27 1 osd.27 down 0 8 1 osd.8 up 1 -11 8 host ibstorage03-sas1 11 1 osd.11 up 1 12 1 osd.12 up 1 13 1 osd.13 up 1 14 1 osd.14 up 1 15 1 osd.15 up 1 16 1 osd.16 up 1 17 1 osd.17 down 0 18 1 osd.18 up 1 the affected OSD was osd.23 on the host "ibstorage02-sa1" -- deleted now. Any thoughts/ thing to check additionally ? Thanks ! |
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com