Re: Old CEPH (0.87) cluster degradation - putting OSDs down one by one

maxxik <maxxik@xxxxxxxxx> · Thu, 10 Mar 2016 15:27:20 +0800



    HI Jan

    
    Yes- that's exactly the case - FS on OSDs was corrupted but it was
    not Intel SATA/SAS:

    
    Hardware :  3x Serial Attached SCSI controller: LSI Logic / Symbios
    Logic SAS2308 PCI-Express Fusion-MPT     SAS-2 (rev 05)

    
    and mpt2sas

    
    however it was almost at once.

    
    Max

    
    On 27/02/2016 11:54 PM, Jan Schermer
      wrote:

    
      Anythink in dmesg/kern.log at the time this happened?
      

           0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1
          *** Caught signal (Aborted) **

        
        I think your filesystem was somehow corrupted.
        

        An regarding this: 2. Physical HDD replaced  and NOT added
          to CEPH - here we had strange kernel crash just after HDD
          connected to the controller.
        What are the drives connected to? We have had problems with
          Intel SATA/SAS driver. You can do a hotplug of a drive but if
          you remove one and put in another the kernel crashes (it only
          happens if some time passes between those two actions, makes
          it very nasty).
        

        Jan
        

            On 27 Feb 2016, at 00:14, maxxik <maxxik@xxxxxxxxx> wrote:
            

               Hi Cephers

                
                At the moment we are trying to recover our CEPH cluser
                (0.87) which is behaving very odd.

                
                What have been done :

                
                1. OSD drive failure happened - CEPH put OSD down and 
                out.

                2. Physical HDD replaced  and NOT added to CEPH - here
                we had strange kernel crash just after HDD connected to
                the controller.

                3. Physical host rebooted.

                4. CEPH started restoration and putting OSD's down one
                by one (actually I can see osd process crush in logs).

                
                ceph.conf is in attachment.

                
                OSD failure :

                
                    -4> 2016-02-26 23:20:47.906443 7f942b4b6700  5 --
                op tracker -- seq: 471061, time: 2016-02-26
                23:20:47.906404, even

                t: header_read, op: pg_backfill(progress 13.77 e
                183964/183964 lb
                45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)

                    -3> 2016-02-26 23:20:47.906451 7f942b4b6700  5 --
                op tracker -- seq: 471061, time: 2016-02-26
                23:20:47.906406, even

                t: throttled, op: pg_backfill(progress 13.77 e
                183964/183964 lb
                45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)

                    -2> 2016-02-26 23:20:47.906456 7f942b4b6700  5 --
                op tracker -- seq: 471061, time: 2016-02-26
                23:20:47.906421, even

                t: all_read, op: pg_backfill(progress 13.77 e
                183964/183964 lb
                45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)

                    -1> 2016-02-26 23:20:47.906462 7f942b4b6700  5 --
                op tracker -- seq: 471061, time: 0.000000, event:
                dispatched, op:

                 pg_backfill(progress 13.77 e 183964/183964 lb
                45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)

                     0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1
                *** Caught signal (Aborted) **

                 in thread 7f9434e0f700

                
                 ceph version 0.87
                (c51c8f9d80fa4e0168aa52685b8de40e42758578)

                 1: /usr/bin/ceph-osd() [0x9e2015]

                 2: (()+0xfcb0) [0x7f945459fcb0]

                 3: (gsignal()+0x35) [0x7f94533d30d5]

                 4: (abort()+0x17b) [0x7f94533d683b]

                 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d)
                [0x7f9453d2569d]

                 6: (()+0xb5846) [0x7f9453d23846]

                 7: (()+0xb5873) [0x7f9453d23873]

                 8: (()+0xb596e) [0x7f9453d2396e]

                 9: (ceph::__ceph_assert_fail(char const*, char const*,
                int, char const*)+0x259) [0xacb979]

                 10: (SnapSet::get_clone_bytes(snapid_t) const+0x15f)
                [0x732c0f]

                 11: (ReplicatedPG::_scrub(ScrubMap&)+0x10c4)
                [0x7f5e54]

                 12: (PG::scrub_compare_maps()+0xcb6) [0x7876e6]

                 13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1c3)
                [0x7880b3]

                 14: (PG::scrub(ThreadPool::TPHandle&)+0x33d)
                [0x789abd]

                 15: (OSD::ScrubWQ::_process(PG*,
                ThreadPool::TPHandle&)+0x13) [0x67ccf3]

                 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e)
                [0xabb3ce]

                 17: (ThreadPool::WorkThread::entry()+0x10) [0xabe160]

                 18: (()+0x7e9a) [0x7f9454597e9a]

                 19: (clone()+0x6d) [0x7f94534912ed]

                 NOTE: a copy of the executable, or `objdump -rdS
                <executable>` is needed to interpret this.

                
                --- logging levels ---

                   0/ 5 none

                   0/ 1 lockdep

                   0/ 1 context

                   1/ 1 crush

                   1/ 5 mds

                   1/ 5 mds_balancer

                   1/ 5 mds_locker

                   1/ 5 mds_log

                   1/ 5 mds_log_expire

                   1/ 5 mds_migrator

                   0/ 1 buffer

                   0/ 1 timer

                   0/ 1 filer

                   0/ 1 striper

                   0/ 1 objecter

                   0/ 5 rados

                   0/ 5 rbd

                   0/ 5 rbd_replay

                   0/ 5 journaler

                   0/ 5 objectcacher

                   0/ 5 client

                   0/ 5 osd

                   0/ 5 optracker

                   0/ 5 objclass

                   1/ 3 filestore

                   1/ 3 keyvaluestore

                   1/ 3 journal

                   0/ 5 ms

                   1/ 5 mon

                   0/10 monc

                   1/ 5 paxos

                   0/ 5 tp

                   1/ 5 auth

                   1/ 5 crypto

                   1/ 1 finisher

                   1/ 5 heartbeatmap

                   1/ 5 perfcounter

                   1/ 5 rgw

                   1/10 civetweb

                   1/ 5 javaclient

                   1/ 5 asok

                   1/ 1 throttle

                   0/ 0 refs

                  -1/-1 (syslog threshold)

                  -1/-1 (stderr threshold)

                  max_recent     10000

                  max_new         1000

                  log_file /var/log/ceph/ceph-osd.27.log

                
                Current OSD tree:

                
                # id    weight  type name       up/down reweight

                -10     2       root ssdtree

                -8      1               host ibstorage01-ssd1

                9       1                       osd.9   up      1

                -9      1               host ibstorage02-ssd1

                10      1                       osd.10  up      1

                -1      22.99   root default

                -7      22.99           room cdsqv1

                -3      22.99                   rack gopc-rack01

                -2      8                               host
                ibstorage01-sas1

                0      
                  1                                       osd.0  
                  down    0

                1       1                                       osd.1  
                up      1

                2       1                                       osd.2  
                up      1

                3      
                  1                                       osd.3  
                  down    0

                7       1                                       osd.7  
                up      1

                4       1                                       osd.4  
                up      1

                5       1                                       osd.5  
                up      1

                6       1                                       osd.6  
                up      1

                -4      6.99                            host
                ibstorage02-sas1

                20     
                  1                                       osd.20 
                  down    0

                21      1.03                                    osd.21 
                up      1

                22     
                  0.96                                    osd.22 
                  down    0

                  25      1                                      
                  osd.25  down    0

                26      1                                       osd.26 
                up      1

                27     
                  1                                       osd.27 
                  down    0

                8       1                                       osd.8  
                up      1

                -11     8                               host
                ibstorage03-sas1

                11      1                                       osd.11 
                up      1

                12      1                                       osd.12 
                up      1

                13      1                                       osd.13 
                up      1

                14      1                                       osd.14 
                up      1

                15      1                                       osd.15 
                up      1

                16      1                                       osd.16 
                up      1

                17     
                  1                                       osd.17 
                  down    0

                18      1                                       osd.18 
                up      1

                
                the affected OSD was osd.23 on the host
                "ibstorage02-sa1" -- deleted now.

                
                Any thoughts/ thing to check additionally ?

                
                Thanks !

                
              <ceph.conf>_______________________________________________

              ceph-users mailing list

              ceph-users@xxxxxxxxxxxxxx

              http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

            
Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com