HI Jan
Yes- that's exactly the case - FS on OSDs was corrupted but it was
not Intel SATA/SAS:
Hardware : 3x Serial Attached SCSI controller: LSI Logic / Symbios
Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
and mpt2sas
however it was almost at once.
Max
On 27/02/2016 11:54 PM, Jan Schermer
wrote:
Anythink in dmesg/kern.log at the time this happened?
0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1
*** Caught signal (Aborted) **
I think your filesystem was somehow corrupted.
An regarding this: 2. Physical HDD replaced and NOT added
to CEPH - here we had strange kernel crash just after HDD
connected to the controller.
What are the drives connected to? We have had problems with
Intel SATA/SAS driver. You can do a hotplug of a drive but if
you remove one and put in another the kernel crashes (it only
happens if some time passes between those two actions, makes
it very nasty).
Jan
Hi Cephers
At the moment we are trying to recover our CEPH cluser
(0.87) which is behaving very odd.
What have been done :
1. OSD drive failure happened - CEPH put OSD down and
out.
2. Physical HDD replaced and NOT added to CEPH - here
we had strange kernel crash just after HDD connected to
the controller.
3. Physical host rebooted.
4. CEPH started restoration and putting OSD's down one
by one (actually I can see osd process crush in logs).
ceph.conf is in attachment.
OSD failure :
-4> 2016-02-26 23:20:47.906443 7f942b4b6700 5 --
op tracker -- seq: 471061, time: 2016-02-26
23:20:47.906404, even
t: header_read, op: pg_backfill(progress 13.77 e
183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
-3> 2016-02-26 23:20:47.906451 7f942b4b6700 5 --
op tracker -- seq: 471061, time: 2016-02-26
23:20:47.906406, even
t: throttled, op: pg_backfill(progress 13.77 e
183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
-2> 2016-02-26 23:20:47.906456 7f942b4b6700 5 --
op tracker -- seq: 471061, time: 2016-02-26
23:20:47.906421, even
t: all_read, op: pg_backfill(progress 13.77 e
183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
-1> 2016-02-26 23:20:47.906462 7f942b4b6700 5 --
op tracker -- seq: 471061, time: 0.000000, event:
dispatched, op:
pg_backfill(progress 13.77 e 183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1
*** Caught signal (Aborted) **
in thread 7f9434e0f700
ceph version 0.87
(c51c8f9d80fa4e0168aa52685b8de40e42758578)
1: /usr/bin/ceph-osd() [0x9e2015]
2: (()+0xfcb0) [0x7f945459fcb0]
3: (gsignal()+0x35) [0x7f94533d30d5]
4: (abort()+0x17b) [0x7f94533d683b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d)
[0x7f9453d2569d]
6: (()+0xb5846) [0x7f9453d23846]
7: (()+0xb5873) [0x7f9453d23873]
8: (()+0xb596e) [0x7f9453d2396e]
9: (ceph::__ceph_assert_fail(char const*, char const*,
int, char const*)+0x259) [0xacb979]
10: (SnapSet::get_clone_bytes(snapid_t) const+0x15f)
[0x732c0f]
11: (ReplicatedPG::_scrub(ScrubMap&)+0x10c4)
[0x7f5e54]
12: (PG::scrub_compare_maps()+0xcb6) [0x7876e6]
13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1c3)
[0x7880b3]
14: (PG::scrub(ThreadPool::TPHandle&)+0x33d)
[0x789abd]
15: (OSD::ScrubWQ::_process(PG*,
ThreadPool::TPHandle&)+0x13) [0x67ccf3]
16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e)
[0xabb3ce]
17: (ThreadPool::WorkThread::entry()+0x10) [0xabe160]
18: (()+0x7e9a) [0x7f9454597e9a]
19: (clone()+0x6d) [0x7f94534912ed]
NOTE: a copy of the executable, or `objdump -rdS
<executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
-1/-1 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.27.log
Current OSD tree:
# id weight type name up/down reweight
-10 2 root ssdtree
-8 1 host ibstorage01-ssd1
9 1 osd.9 up 1
-9 1 host ibstorage02-ssd1
10 1 osd.10 up 1
-1 22.99 root default
-7 22.99 room cdsqv1
-3 22.99 rack gopc-rack01
-2 8 host
ibstorage01-sas1
0
1 osd.0
down 0
1 1 osd.1
up 1
2 1 osd.2
up 1
3
1 osd.3
down 0
7 1 osd.7
up 1
4 1 osd.4
up 1
5 1 osd.5
up 1
6 1 osd.6
up 1
-4 6.99 host
ibstorage02-sas1
20
1 osd.20
down 0
21 1.03 osd.21
up 1
22
0.96 osd.22
down 0
25 1
osd.25 down 0
26 1 osd.26
up 1
27
1 osd.27
down 0
8 1 osd.8
up 1
-11 8 host
ibstorage03-sas1
11 1 osd.11
up 1
12 1 osd.12
up 1
13 1 osd.13
up 1
14 1 osd.14
up 1
15 1 osd.15
up 1
16 1 osd.16
up 1
17
1 osd.17
down 0
18 1 osd.18
up 1
the affected OSD was osd.23 on the host
"ibstorage02-sa1" -- deleted now.
Any thoughts/ thing to check additionally ?
Thanks !
<ceph.conf>_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com