Re: Cluster crash - FAILED assert(interval.last > last)

Josef Zelenka <josef.zelenka@xxxxxxxxxxxxxxxx> · Thu, 11 Jan 2018 10:48:04 +0100



    I have posted logs/strace from our osds with details to a ticket
      in the ceph bug tracker - see here
      http://tracker.ceph.com/issues/21142. You can see where exactly
      the OSDs crash etc, this can be of help if someone decides to
      debug it.
    JZ

    
    On 10/01/18 22:05, Josef Zelenka wrote:

    
      Hi, today we had a disasterous crash - we are running a 3 node,
        24 osd in total cluster (8 each) with SSDs for blockdb, HDD for
        bluestore data. This cluster is used as a radosgw backend, for
        storing a big number of thumbnails for a file hosting site -
        around 110m files in total. We were adding an interface to the
        nodes which required a restart, but after restarting one of the
        nodes, a lot of the OSDs were kicked out of the cluster and rgw
        stopped working. We have a lot of pgs down and unfound atm. OSDs
        can't be started(aside from some, that's a mystery) with this
        error -  FAILED
          assert ( interval.last > last) - they just periodically
          restart. So far, the cluster is broken and we can't seem to
          bring it back up. We tried fscking the osds via the ceph
          objectstore tool, but it was no good. The root of all this
          seems to be in the FAILED assert(interval.last > last)
          error, however i can't find any info regarding this or how to
          fix it. Did someone here also encounter it? We're running
          luminous on ubuntu 16.04.
      Thanks
      Josef Zelenka
      Cloudevelops

      
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com