Re: Old CEPH (0.87) cluster degradation - putting OSDs down one by one

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Anythink in dmesg/kern.log at the time this happened?

     0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1 *** Caught signal (Aborted) **

I think your filesystem was somehow corrupted.

An regarding this: 2. Physical HDD replaced  and NOT added to CEPH - here we had strange kernel crash just after HDD connected to the controller.
What are the drives connected to? We have had problems with Intel SATA/SAS driver. You can do a hotplug of a drive but if you remove one and put in another the kernel crashes (it only happens if some time passes between those two actions, makes it very nasty).

Jan



On 27 Feb 2016, at 00:14, maxxik <maxxik@xxxxxxxxx> wrote:

Hi Cephers

At the moment we are trying to recover our CEPH cluser (0.87) which is behaving very odd.

What have been done :

1. OSD drive failure happened - CEPH put OSD down and  out.
2. Physical HDD replaced  and NOT added to CEPH - here we had strange kernel crash just after HDD connected to the controller.
3. Physical host rebooted.
4. CEPH started restoration and putting OSD's down one by one (actually I can see osd process crush in logs).

ceph.conf is in attachment.


OSD failure :

    -4> 2016-02-26 23:20:47.906443 7f942b4b6700  5 -- op tracker -- seq: 471061, time: 2016-02-26 23:20:47.906404, even
t: header_read, op: pg_backfill(progress 13.77 e 183964/183964 lb 45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
    -3> 2016-02-26 23:20:47.906451 7f942b4b6700  5 -- op tracker -- seq: 471061, time: 2016-02-26 23:20:47.906406, even
t: throttled, op: pg_backfill(progress 13.77 e 183964/183964 lb 45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
    -2> 2016-02-26 23:20:47.906456 7f942b4b6700  5 -- op tracker -- seq: 471061, time: 2016-02-26 23:20:47.906421, even
t: all_read, op: pg_backfill(progress 13.77 e 183964/183964 lb 45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
    -1> 2016-02-26 23:20:47.906462 7f942b4b6700  5 -- op tracker -- seq: 471061, time: 0.000000, event: dispatched, op:
 pg_backfill(progress 13.77 e 183964/183964 lb 45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
     0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1 *** Caught signal (Aborted) **
 in thread 7f9434e0f700

 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
 1: /usr/bin/ceph-osd() [0x9e2015]
 2: (()+0xfcb0) [0x7f945459fcb0]
 3: (gsignal()+0x35) [0x7f94533d30d5]
 4: (abort()+0x17b) [0x7f94533d683b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9453d2569d]
 6: (()+0xb5846) [0x7f9453d23846]
 7: (()+0xb5873) [0x7f9453d23873]
 8: (()+0xb596e) [0x7f9453d2396e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x259) [0xacb979]
 10: (SnapSet::get_clone_bytes(snapid_t) const+0x15f) [0x732c0f]
 11: (ReplicatedPG::_scrub(ScrubMap&)+0x10c4) [0x7f5e54]
 12: (PG::scrub_compare_maps()+0xcb6) [0x7876e6]
 13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1c3) [0x7880b3]
 14: (PG::scrub(ThreadPool::TPHandle&)+0x33d) [0x789abd]
 15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x67ccf3]
 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xabb3ce]
 17: (ThreadPool::WorkThread::entry()+0x10) [0xabe160]
 18: (()+0x7e9a) [0x7f9454597e9a]
 19: (clone()+0x6d) [0x7f94534912ed]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
  -1/-1 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.27.log


Current OSD tree:


# id    weight  type name       up/down reweight
-10     2       root ssdtree
-8      1               host ibstorage01-ssd1
9       1                       osd.9   up      1
-9      1               host ibstorage02-ssd1
10      1                       osd.10  up      1
-1      22.99   root default
-7      22.99           room cdsqv1
-3      22.99                   rack gopc-rack01
-2      8                               host ibstorage01-sas1
0       1                                       osd.0   down    0
1       1                                       osd.1   up      1
2       1                                       osd.2   up      1
3       1                                       osd.3   down    0
7       1                                       osd.7   up      1
4       1                                       osd.4   up      1
5       1                                       osd.5   up      1
6       1                                       osd.6   up      1
-4      6.99                            host ibstorage02-sas1
20      1                                       osd.20  down    0
21      1.03                                    osd.21  up      1
22      0.96                                    osd.22  down    0
25      1                                       osd.25  down    0

26      1                                       osd.26  up      1
27      1                                       osd.27  down    0
8       1                                       osd.8   up      1
-11     8                               host ibstorage03-sas1
11      1                                       osd.11  up      1
12      1                                       osd.12  up      1
13      1                                       osd.13  up      1
14      1                                       osd.14  up      1
15      1                                       osd.15  up      1
16      1                                       osd.16  up      1
17      1                                       osd.17  down    0
18      1                                       osd.18  up      1

the affected OSD was osd.23 on the host "ibstorage02-sa1" -- deleted now.


Any thoughts/ thing to check additionally ?

Thanks !

<ceph.conf>_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux