Hi Cephers At the moment we are trying to recover our CEPH cluser (0.87) which is behaving very odd. What have been done : 1. OSD drive failure happened - CEPH put OSD down and out. 2. Physical HDD replaced and NOT added to CEPH - here we had strange kernel crash just after HDD connected to the controller. 3. Physical host rebooted. 4. CEPH started restoration and putting OSD's down one by one (actually I can see osd process crush in logs). ceph.conf is in attachment. OSD failure : -4> 2016-02-26 23:20:47.906443 7f942b4b6700 5 -- op tracker -- seq: 471061, time: 2016-02-26 23:20:47.906404, even t: header_read, op: pg_backfill(progress 13.77 e 183964/183964 lb 45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13) -3> 2016-02-26 23:20:47.906451 7f942b4b6700 5 -- op tracker -- seq: 471061, time: 2016-02-26 23:20:47.906406, even t: throttled, op: pg_backfill(progress 13.77 e 183964/183964 lb 45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13) -2> 2016-02-26 23:20:47.906456 7f942b4b6700 5 -- op tracker -- seq: 471061, time: 2016-02-26 23:20:47.906421, even t: all_read, op: pg_backfill(progress 13.77 e 183964/183964 lb 45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13) -1> 2016-02-26 23:20:47.906462 7f942b4b6700 5 -- op tracker -- seq: 471061, time: 0.000000, event: dispatched, op: pg_backfill(progress 13.77 e 183964/183964 lb 45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13) 0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1 *** Caught signal (Aborted) ** in thread 7f9434e0f700 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) 1: /usr/bin/ceph-osd() [0x9e2015] 2: (()+0xfcb0) [0x7f945459fcb0] 3: (gsignal()+0x35) [0x7f94533d30d5] 4: (abort()+0x17b) [0x7f94533d683b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9453d2569d] 6: (()+0xb5846) [0x7f9453d23846] 7: (()+0xb5873) [0x7f9453d23873] 8: (()+0xb596e) [0x7f9453d2396e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x259) [0xacb979] 10: (SnapSet::get_clone_bytes(snapid_t) const+0x15f) [0x732c0f] 11: (ReplicatedPG::_scrub(ScrubMap&)+0x10c4) [0x7f5e54] 12: (PG::scrub_compare_maps()+0xcb6) [0x7876e6] 13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1c3) [0x7880b3] 14: (PG::scrub(ThreadPool::TPHandle&)+0x33d) [0x789abd] 15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x67ccf3] 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xabb3ce] 17: (ThreadPool::WorkThread::entry()+0x10) [0xabe160] 18: (()+0x7e9a) [0x7f9454597e9a] 19: (clone()+0x6d) [0x7f94534912ed] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs -1/-1 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.27.log Current OSD tree: # id weight type name up/down reweight -10 2 root ssdtree -8 1 host ibstorage01-ssd1 9 1 osd.9 up 1 -9 1 host ibstorage02-ssd1 10 1 osd.10 up 1 -1 22.99 root default -7 22.99 room cdsqv1 -3 22.99 rack gopc-rack01 -2 8 host ibstorage01-sas1 0 1 osd.0 down 0 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 down 0 7 1 osd.7 up 1 4 1 osd.4 up 1 5 1 osd.5 up 1 6 1 osd.6 up 1 -4 6.99 host ibstorage02-sas1 20 1 osd.20 down 0 21 1.03 osd.21 up 1 22 0.96 osd.22 down 0 25 1 osd.25 down 0 26 1 osd.26 up 1 27 1 osd.27 down 0 8 1 osd.8 up 1 -11 8 host ibstorage03-sas1 11 1 osd.11 up 1 12 1 osd.12 up 1 13 1 osd.13 up 1 14 1 osd.14 up 1 15 1 osd.15 up 1 16 1 osd.16 up 1 17 1 osd.17 down 0 18 1 osd.18 up 1 the affected OSD was osd.23 on the host "ibstorage02-sa1" -- deleted now. Any thoughts/ thing to check additionally ? Thanks ! |
[global] # For version 0.55 and beyond, you must explicitly enable # or disable authentication with "auth" entries in [global]. auth cluster required = none auth service required = none auth client required = none osd pool default size = 2 osd pool default min size = 1 osd recovery max active = 1 osd deep scrub interval = 1814400 journal queue max ops = 1000 journal queue max bytes = 104857600 public network = 10.10.0.0/24 cluster network = 10.11.0.0/24 err to syslog = true [mon] mon cluster log to syslog = true [osd] osd journal size = 1700 osd mkfs type = "ext4" osd mkfs options ext4 = user_xattr,rw,noatime osd data = /srv/ceph/osd$id osd journal = /srv/journal/osd$id/journal osd crush update on start = false [mon.1] host = ibstorage01 mon addr = 10.10.0.48:6789 mon data = /srv/mondata [mon.2] host = ibstorage02 mon addr = 10.10.0.49:6789 mon data = /srv/mondata [mon.3] host = ibstorage03 mon addr = 10.10.0.50:6789 mon data = /srv/mondata [mon.3] host = ibstorage03 mon addr = 10.10.0.50:6789 mon data = /srv/mondata [osd.0] host = ibstorage01 public addr = 10.10.0.48 cluster addr = 10.11.0.48 osd crush location = host=ibstorage01-sas1 [osd.1] host = ibstorage01 public addr = 10.10.0.48 cluster addr = 10.11.0.48 osd crush location = host=ibstorage01-sas1 [osd.2] host = ibstorage01 public addr = 10.10.0.48 cluster addr = 10.11.0.48 osd crush location = host=ibstorage01-sas1 [osd.3] host = ibstorage01 public addr = 10.10.0.48 cluster addr = 10.11.0.48 osd crush location = host=ibstorage01-sas1 [osd.4] host = ibstorage01 public addr = 10.10.0.48 cluster addr = 10.11.0.48 osd crush location = host=ibstorage01-sas2 [osd.5] host = ibstorage01 public addr = 10.10.0.48 cluster addr = 10.11.0.48 osd crush location = host=ibstorage01-sas2 [osd.6] host = ibstorage01 public addr = 10.10.0.48 cluster addr = 10.11.0.48 osd crush location = host=ibstorage01-sas2 [osd.7] host = ibstorage01 public addr = 10.10.0.48 cluster addr = 10.11.0.48 osd crush location = host=ibstorage01-sas2 [osd.9] host = ibstorage01 public addr = 10.10.0.48 cluster addr = 10.11.0.48 osd crush location = host=ibstorage01-ssd1 osd journal size = 10000 osd data = /srv/ceph/osd9 osd journal = /srv/ceph/osd9/journal [osd.10] host = ibstorage02 public addr = 10.10.0.49 cluster addr = 10.11.0.49 osd crush location = host=ibstorage02-ssd1 osd journal size = 10000 osd data = /srv/ceph/osd10 osd journal = /srv/ceph/osd10/journal [osd.11] host = ibstorage03 public addr = 10.10.0.50 cluster addr = 10.11.0.50 osd crush location = host=ibstorage03-sas1 [osd.12] host = ibstorage03 public addr = 10.10.0.50 cluster addr = 10.11.0.50 osd crush location = host=ibstorage03-sas1 [osd.20] host = ibstorage02 public addr = 10.10.0.49 cluster addr = 10.11.0.49 osd crush location = host=ibstorage02-sas1 [osd.21] host = ibstorage02 public addr = 10.10.0.49 cluster addr = 10.11.0.49 osd crush location = host=ibstorage02-sas1 [osd.22] host = ibstorage02 public addr = 10.10.0.49 cluster addr = 10.11.0.49 osd crush location = host=ibstorage02-sas1 [osd.23] host = ibstorage02 public addr = 10.10.0.49 cluster addr = 10.11.0.49 osd crush location = host=ibstorage02-sas1 [osd.8] host = ibstorage02 public addr = 10.10.0.49 cluster addr = 10.11.0.49 osd crush location = host=ibstorage02-sas1 [osd.25] host = ibstorage02 public addr = 10.10.0.49 cluster addr = 10.11.0.49 osd crush location = host=ibstorage02-sas1 [osd.26] host = ibstorage02 public addr = 10.10.0.49 cluster addr = 10.11.0.49 osd crush location = host=ibstorage02-sas1 [osd.27] host = ibstorage02 public addr = 10.10.0.49 cluster addr = 10.11.0.49 osd crush location = host=ibstorage02-sas1
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com