Re: OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

Steve Anthony <sma310@xxxxxxxxxx> · Thu, 18 May 2017 13:06:05 -0400



    Hmmm, after crashing for a few days every 30 seconds it's
      apparently running normally again. Weird. I was thinking since
      it's looking for a snapshot object, maybe re-enabling snaptrimming
      and removing all the snapshots in the pool would remove that
      object (and the problem)? Never got to that point this time, but
      I'm going to need to cycle more OSDs in and out of the cluster, so
      if it happens again I might try that and update. 

    
    Thanks!
    -Steve

    
    On 05/17/2017 03:17 PM, Gregory Farnum
      wrote:

    
          On Wed, May 17, 2017 at 10:51 AM Steve Anthony
            <sma310@xxxxxxxxxx>
            wrote:

          
          Hello,

            
            After starting a backup (create snap, export and import into
            a second

            cluster - one RBD image still exporting/importing as of this
            message)

            the other day while recovery operations on the primary
            cluster were

            ongoing I noticed an OSD (osd.126) start to crash; I
            reweighted it to 0

            to prepare to remove it. Shortly thereafter I noticed the
            problem seemed

            to move to another OSD (osd.223). After looking at the logs,
            I noticed

            they appeared to have the same problem. I'm running Ceph
            version 9.2.1

            (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8.

            
            Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe

            
            Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA

            
            May 15 10:39:55 ceph13 ceph-osd[21506]: -9308> 2017-05-15

            10:39:51.561342 7f225c385900 -1 osd.126 616621
            log_to_monitors

            {default=true}

            May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15
            10:39:55.328897

            7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual
            void

            ReplicatedPG::on_local_recover(const hobject_t&, const

            object_stat_sum_t&, const ObjectRecoveryInfo&,
            ObjectContextRef,

            ObjectStore::Transaction*)' thread 7f2236be3700 time
            2017-05-15

            10:39:55.322306

            May 15 10:39:55 ceph13 ceph-osd[21506]: osd/ReplicatedPG.cc:
            192: FAILED

            assert(recovery_info.oi.snaps.size())

            
            May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15
            16:45:25.343391

            7ff40f41e900 -1 osd.223 619808 log_to_monitors
            {default=true}

            May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc:
            In function

            'virtual void ReplicatedPG::on_local_recover(const
            hobject_t&, const

            object_stat_sum_t&, const ObjectRecoveryInfo&,
            ObjectContextRef,

            ObjectStore::Transaction*)' thread 7ff3eab63700 time
            2017-05-15

            16:45:30.799839

            May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc:
            192: FAILED

            assert(recovery_info.oi.snaps.size())

            
            I did some searching and thought it might be related to

            http://tracker.ceph.com/issues/13837
            aka

            https://bugzilla.redhat.com/show_bug.cgi?id=1351320
            so I disabled

            scrubbing and deep-scrubbing, and set
            osd_pg_max_concurrent_snap_trims

            to 0 for all OSDs. No luck. I had changed the systemd
            service file to

            automatically restart osd.223 while recovery was happening,
            but it

            appears to have stalled; I suppose it's needed up for the
            remaining objects.

          
          Yeah, these aren't really related that I can see — though
            I haven't spent much time in this code that I can recall.
            The OSD is receiving a "push" as part of log recovery and
            finds that the object it's receiving is a snapshot object
            without having any information about the snap IDs that
            exist, which is weird. I don't know of any way a client
            could break it either, but maybe David or Jason know
            something more.
          -Greg
           
          
            I didn't see anything else online, so I thought I see if
            anyone has seen

            this before or has any other ideas. Thanks for taking the
            time.

            
            -Steve

            
            --

            Steve Anthony

            LTS HPC Senior Analyst

            Lehigh University

            sma310@xxxxxxxxxx

            
            _______________________________________________

            ceph-users mailing list

            ceph-users@xxxxxxxxxxxxxx

            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

          
    -- 
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma310@xxxxxxxxxx
  

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com