Anythink in dmesg/kern.log at the time this happened?
0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1 *** Caught signal (Aborted) **
I think your filesystem was somehow corrupted.
An regarding this: 2. Physical HDD replaced and NOT added to CEPH - here we had strange kernel crash just after HDD connected to the controller. What are the drives connected to? We have had problems with Intel SATA/SAS driver. You can do a hotplug of a drive but if you remove one and put in another the kernel crashes (it only happens if some time passes between those two actions, makes it very nasty).
Jan
Hi Cephers
At the moment we are trying to recover our CEPH cluser (0.87) which
is behaving very odd.
What have been done :
1. OSD drive failure happened - CEPH put OSD down and out.
2. Physical HDD replaced and NOT added to CEPH - here we had
strange kernel crash just after HDD connected to the controller.
3. Physical host rebooted.
4. CEPH started restoration and putting OSD's down one by one
(actually I can see osd process crush in logs).
ceph.conf is in attachment.
OSD failure :
-4> 2016-02-26 23:20:47.906443 7f942b4b6700 5 -- op tracker
-- seq: 471061, time: 2016-02-26 23:20:47.906404, even
t: header_read, op: pg_backfill(progress 13.77 e 183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
-3> 2016-02-26 23:20:47.906451 7f942b4b6700 5 -- op tracker
-- seq: 471061, time: 2016-02-26 23:20:47.906406, even
t: throttled, op: pg_backfill(progress 13.77 e 183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
-2> 2016-02-26 23:20:47.906456 7f942b4b6700 5 -- op tracker
-- seq: 471061, time: 2016-02-26 23:20:47.906421, even
t: all_read, op: pg_backfill(progress 13.77 e 183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
-1> 2016-02-26 23:20:47.906462 7f942b4b6700 5 -- op tracker
-- seq: 471061, time: 0.000000, event: dispatched, op:
pg_backfill(progress 13.77 e 183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1 *** Caught
signal (Aborted) **
in thread 7f9434e0f700
ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
1: /usr/bin/ceph-osd() [0x9e2015]
2: (()+0xfcb0) [0x7f945459fcb0]
3: (gsignal()+0x35) [0x7f94533d30d5]
4: (abort()+0x17b) [0x7f94533d683b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d)
[0x7f9453d2569d]
6: (()+0xb5846) [0x7f9453d23846]
7: (()+0xb5873) [0x7f9453d23873]
8: (()+0xb596e) [0x7f9453d2396e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x259) [0xacb979]
10: (SnapSet::get_clone_bytes(snapid_t) const+0x15f) [0x732c0f]
11: (ReplicatedPG::_scrub(ScrubMap&)+0x10c4) [0x7f5e54]
12: (PG::scrub_compare_maps()+0xcb6) [0x7876e6]
13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1c3) [0x7880b3]
14: (PG::scrub(ThreadPool::TPHandle&)+0x33d) [0x789abd]
15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13)
[0x67ccf3]
16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xabb3ce]
17: (ThreadPool::WorkThread::entry()+0x10) [0xabe160]
18: (()+0x7e9a) [0x7f9454597e9a]
19: (clone()+0x6d) [0x7f94534912ed]
NOTE: a copy of the executable, or `objdump -rdS
<executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
-1/-1 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.27.log
Current OSD tree:
# id weight type name up/down reweight
-10 2 root ssdtree
-8 1 host ibstorage01-ssd1
9 1 osd.9 up 1
-9 1 host ibstorage02-ssd1
10 1 osd.10 up 1
-1 22.99 root default
-7 22.99 room cdsqv1
-3 22.99 rack gopc-rack01
-2 8 host ibstorage01-sas1
0
1 osd.0 down 0
1 1 osd.1 up 1
2 1 osd.2 up 1
3
1 osd.3 down 0
7 1 osd.7 up 1
4 1 osd.4 up 1
5 1 osd.5 up 1
6 1 osd.6 up 1
-4 6.99 host ibstorage02-sas1
20
1 osd.20 down 0
21 1.03 osd.21 up 1
22
0.96 osd.22 down 0
25 1 osd.25 down 0
26 1 osd.26 up 1
27
1 osd.27 down 0
8 1 osd.8 up 1
-11 8 host ibstorage03-sas1
11 1 osd.11 up 1
12 1 osd.12 up 1
13 1 osd.13 up 1
14 1 osd.14 up 1
15 1 osd.15 up 1
16 1 osd.16 up 1
17
1 osd.17 down 0
18 1 osd.18 up 1
the affected OSD was osd.23 on the host "ibstorage02-sa1" -- deleted
now.
Any thoughts/ thing to check additionally ?
Thanks !
<ceph.conf>_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxxhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com