Re: Ceph OSD crash starting up

Gonzalo Aguilar Delgado <gaguilar@xxxxxxxxxxxxxxxxxx> · Thu, 14 Sep 2017 22:39:42 +0200



    Hello, 

      
    I was on a old version
        of ceph. And it showed a warning saying:
    crush map has straw_calc_version=0
    I rode that adjusting it will only rebalance all
        so admin should select when to do it. So I went straigth and
        ran:
    

      ceph
          osd crush tunables optimal
    

        It rebalanced as it said but then I started to have lots of
        pg wrong. I discovered that it was because my OSD1. I thought it
        was disk faillure so I added a new OSD6 and system started to
        rebalance. Anyway OSD was not starting.
    I thought to wipe it all. But I preferred to
        leave disk as it was, and journal intact, in case I can recover
        and get data from it. (See mail:  Scrub failing all
        the time, new inconsistencies keep appearing). 

      
        So here's the information. But it has OSD1 replaced by OSD3,
        sorry. 

      
    ID WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR 
        PGS 

         0 1.00000  1.00000  926G  271G  654G 29.34 1.10 369 

         2 1.00000  1.00000  460G  284G  176G 61.67 2.32 395 

         4 1.00000  1.00000  465G  151G  313G 32.64 1.23 214 

         3 1.36380  1.00000 1396G  239G 1157G 17.13 0.64 340 

         6 0.90919  1.00000  931G  164G  766G 17.70 0.67 210 

                      TOTAL 4179G 1111G 3067G 26.60          

        MIN/MAX VAR: 0.64/2.32  STDDEV: 16.99

        
    As I said I still have OSD1 intact so I can do whatever you need
    except readding to the cluster. Since I don't know what It will do,
    maybe cause havok.

    Best regards,

    
    On 14/09/17 17:12, David Turner wrote:

    
      What do you mean by "updated
          crush map to 1"?  Can you please provide a copy of your crush
          map and `ceph osd df`?
      

        On Wed, Sep 13, 2017 at 6:39 AM Gonzalo Aguilar
          Delgado <gaguilar@xxxxxxxxxxxxxxxxxx>
          wrote:

        
            Hi, 

              
            I'recently
                updated crush map to 1 and did all relocation of the
                pgs. At the end I found that one of the OSD is not
                starting. 

              
            This is what it
                shows:
            

            2017-09-13 10:37:34.287248 7f49cbe12700 -1 *** Caught
              signal (Aborted) **

               in thread 7f49cbe12700 thread_name:filestore_sync

              
               ceph version 10.2.7
              (50e863e0f4bc8f4b9e31156de690d765af245185)

               1: (()+0x9616ee) [0xa93c6ef6ee]

               2: (()+0x11390) [0x7f49d9937390]

               3: (gsignal()+0x38) [0x7f49d78d3428]

               4: (abort()+0x16a) [0x7f49d78d502a]

               5: (ceph::__ceph_assert_fail(char const*, char const*,
              int, char const*)+0x26b) [0xa93c7ef43b]

               6: (FileStore::sync_entry()+0x2bbb) [0xa93c47fcbb]

               7: (FileStore::SyncThread::entry()+0xd) [0xa93c4adcdd]

               8: (()+0x76ba) [0x7f49d992d6ba]

               9: (clone()+0x6d) [0x7f49d79a53dd]

               NOTE: a copy of the executable, or `objdump -rdS
              <executable>` is needed to interpret this.

              
              --- begin dump of recent events ---

                  -3> 2017-09-13 10:37:34.253808 7f49dac6e8c0  5
              osd.1 pg_epoch: 6293 pg[10.8c( v 6220'575937
              (4942'572901,6220'575937] local-les=6235 n=282 ec=419
              les/c/f 6235/6235/0 6293/6293/6290) [1,2]/[2] r=-1 lpr=0
              pi=6234-6292/24 crt=6220'575937 lcod 0'0 inactive NOTIFY
              NIBBLEWISE] exit Initial 0.029683 0 0.000000

                  -2> 2017-09-13 10:37:34.253848 7f49dac6e8c0  5
              osd.1 pg_epoch: 6293 pg[10.8c( v 6220'575937
              (4942'572901,6220'575937] local-les=6235 n=282 ec=419
              les/c/f 6235/6235/0 6293/6293/6290) [1,2]/[2] r=-1 lpr=0
              pi=6234-6292/24 crt=6220'575937 lcod 0'0 inactive NOTIFY
              NIBBLEWISE] enter Reset

                  -1> 2017-09-13 10:37:34.255018 7f49dac6e8c0  5
              osd.1 pg_epoch: 6293 pg[10.90(unlocked)] enter Initial

                   0> 2017-09-13 10:37:34.287248 7f49cbe12700 -1 ***
              Caught signal (Aborted) **

               in thread 7f49cbe12700 thread_name:filestore_sync

              
               ceph version 10.2.7
              (50e863e0f4bc8f4b9e31156de690d765af245185)

               1: (()+0x9616ee) [0xa93c6ef6ee]

               2: (()+0x11390) [0x7f49d9937390]

               3: (gsignal()+0x38) [0x7f49d78d3428]

               4: (abort()+0x16a) [0x7f49d78d502a]

               5: (ceph::__ceph_assert_fail(char const*, char const*,
              int, char const*)+0x26b) [0xa93c7ef43b]

               6: (FileStore::sync_entry()+0x2bbb) [0xa93c47fcbb]

               7: (FileStore::SyncThread::entry()+0xd) [0xa93c4adcdd]

               8: (()+0x76ba) [0x7f49d992d6ba]

               9: (clone()+0x6d) [0x7f49d79a53dd]

               NOTE: a copy of the executable, or `objdump -rdS
              <executable>` is needed to interpret this.

              
              --- logging levels ---

                 0/ 5 none

                 0/ 1 lockdep

                 0/ 1 context

                 1/ 1 crush

                 1/ 5 mds

                 1/ 5 mds_balancer

                 1/ 5 mds_locker

                 1/ 5 mds_log

                 1/ 5 mds_log_expire

                 1/ 5 mds_migrator

                 0/ 1 buffer

                 0/ 1 timer

                 0/ 1 filer

                 0/ 1 striper

                 0/ 1 objecter

                 0/ 5 rados

                 0/ 5 rbd

                 0/ 5 rbd_mirror

                 0/ 5 rbd_replay

                 0/ 5 journaler

                 0/ 5 objectcacher

                 0/ 5 client

                 0/ 5 osd

                 0/ 5 optracker

                 0/ 5 objclass

                 1/ 3 filestore

                 1/ 3 journal

                 0/ 5 ms

                 1/ 5 mon

                 0/10 monc

                 1/ 5 paxos

                 0/ 5 tp

                 1/ 5 auth

                 1/ 5 crypto

                 1/ 1 finisher

                 1/ 5 heartbeatmap

                 1/ 5 perfcounter

                 1/ 5 rgw

                 1/10 civetweb

                 1/ 5 javaclient

                 1/ 5 asok

                 1/ 1 throttle

                 0/ 0 refs

                 1/ 5 xio

                 1/ 5 compressor

                 1/ 5 newstore

                 1/ 5 bluestore

                 1/ 5 bluefs

                 1/ 3 bdev

                 1/ 5 kstore

                 4/ 5 rocksdb

                 4/ 5 leveldb

                 1/ 5 kinetic

                 1/ 5 fuse

                -2/-2 (syslog threshold)

                -1/-1 (stderr threshold)

                max_recent     10000

                max_new         1000

                log_file /var/log/ceph/ceph-osd.1.log

              --- end dump of recent events ---

            
            Is there any way to recover it or should I open a bug?
            

            Best regards

            
          _______________________________________________

          ceph-users mailing list

          ceph-users@xxxxxxxxxxxxxx

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        
Attachment:
crushmap-9028f4da-0d77-462b-be9b-dbdf7fa57771-20171014.crushmap

Description: image/icon
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com