OSD fails to start after power failure

David Young <davidy@xxxxxxxxxxxxxxxxxx> · Sun, 15 Jul 2018 09:55:01 +1200



    Hey folks,

        
        I have a Luminous 12.2.6
            cluster which suffered a power failure
              recently. On recovery, one of my OSDs
                is continually crashing and
                  restarting, with the error below:

                  
          ----

            9ae00 con 0

                  -3> 2018-07-15 09:50:58.313242 7f131c5a9700 10
              monclient: tick

                  -2> 2018-07-15 09:50:58.313277 7f131c5a9700 10
              monclient: _check_auth_rotating have uptodate secrets
              (they expire after 2018-07-15 09:50:28.313274)

                  -1> 2018-07-15 09:50:58.313320 7f131c5a9700 10
              log_client  log_queue is 8 last_log 10 sent 0 num 8 unsent
              10 sending 10

                   0> 2018-07-15 09:50:58.320255 7f131c5a9700 -1
              /build/ceph-12.2.6/src/common/LogClient.cc: In function
              'Message* LogClient::_get_mon_log_message()' thread
              7f131c5a9700 time 2018-07-15 09:50:58.313336

              /build/ceph-12.2.6/src/common/LogClient.cc: 294: FAILED
              assert(num_unsent <= log_queue.size())

              ----

                
              I've found a few
    recent references to this "FAILED assert" message (assuming that's
    the cause of the problem), such as
    https://bugzilla.redhat.com/show_bug.cgi?id=1599718 and
    http://tracker.ceph.com/issues/18209, with the most recent occurance
    being 3 days ago (http://tracker.ceph.com/issues/18209#note-12).

    
    Is there any resolution to this issue, or anything I can attempt to
    recover?

    
    Thanks!

    D

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com