Re: OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

Steve Anthony <sma310@xxxxxxxxxx> · Fri, 2 Jun 2017 14:15:40 -0400



    I'm seeing this again on two OSDs after adding another 20 disks
      to my cluster. Is there someway I can maybe determine which
      snapshots the recovery process is looking for? Or maybe find and
      remove the objects it's trying to recover, since there's
      apparently a problem with them? Thanks!
    -Steve

    
    On 05/18/2017 01:06 PM, Steve Anthony
      wrote:

    
      Hmmm, after crashing for a few days every 30 seconds it's
        apparently running normally again. Weird. I was thinking since
        it's looking for a snapshot object, maybe re-enabling
        snaptrimming and removing all the snapshots in the pool would
        remove that object (and the problem)? Never got to that point
        this time, but I'm going to need to cycle more OSDs in and out
        of the cluster, so if it happens again I might try that and
        update. 

      
      Thanks!
      -Steve

      
      On 05/17/2017 03:17 PM, Gregory
        Farnum wrote:

      
            On Wed, May 17, 2017 at 10:51 AM Steve
              Anthony <sma310@xxxxxxxxxx>
              wrote:

            
            Hello,

              
              After starting a backup (create snap, export and import
              into a second

              cluster - one RBD image still exporting/importing as of
              this message)

              the other day while recovery operations on the primary
              cluster were

              ongoing I noticed an OSD (osd.126) start to crash; I
              reweighted it to 0

              to prepare to remove it. Shortly thereafter I noticed the
              problem seemed

              to move to another OSD (osd.223). After looking at the
              logs, I noticed

              they appeared to have the same problem. I'm running Ceph
              version 9.2.1

              (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8.

              
              Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe

              
              Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA

              
              May 15 10:39:55 ceph13 ceph-osd[21506]: -9308>
              2017-05-15

              10:39:51.561342 7f225c385900 -1 osd.126 616621
              log_to_monitors

              {default=true}

              May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15
              10:39:55.328897

              7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual
              void

              ReplicatedPG::on_local_recover(const hobject_t&, const

              object_stat_sum_t&, const ObjectRecoveryInfo&,
              ObjectContextRef,

              ObjectStore::Transaction*)' thread 7f2236be3700 time
              2017-05-15

              10:39:55.322306

              May 15 10:39:55 ceph13 ceph-osd[21506]:
              osd/ReplicatedPG.cc: 192: FAILED

              assert(recovery_info.oi.snaps.size())

              
              May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15
              16:45:25.343391

              7ff40f41e900 -1 osd.223 619808 log_to_monitors
              {default=true}

              May 15 16:45:30 ceph19 ceph-osd[30527]:
              osd/ReplicatedPG.cc: In function

              'virtual void ReplicatedPG::on_local_recover(const
              hobject_t&, const

              object_stat_sum_t&, const ObjectRecoveryInfo&,
              ObjectContextRef,

              ObjectStore::Transaction*)' thread 7ff3eab63700 time
              2017-05-15

              16:45:30.799839

              May 15 16:45:30 ceph19 ceph-osd[30527]:
              osd/ReplicatedPG.cc: 192: FAILED

              assert(recovery_info.oi.snaps.size())

              
              I did some searching and thought it might be related to

              http://tracker.ceph.com/issues/13837
              aka

              https://bugzilla.redhat.com/show_bug.cgi?id=1351320
              so I disabled

              scrubbing and deep-scrubbing, and set
              osd_pg_max_concurrent_snap_trims

              to 0 for all OSDs. No luck. I had changed the systemd
              service file to

              automatically restart osd.223 while recovery was
              happening, but it

              appears to have stalled; I suppose it's needed up for the
              remaining objects.

            
            Yeah, these aren't really related that I can see —
              though I haven't spent much time in this code that I can
              recall. The OSD is receiving a "push" as part of log
              recovery and finds that the object it's receiving is a
              snapshot object without having any information about the
              snap IDs that exist, which is weird. I don't know of any
              way a client could break it either, but maybe David or
              Jason know something more.
            -Greg
             
             
              I didn't see anything else online, so I thought I see if
              anyone has seen

              this before or has any other ideas. Thanks for taking the
              time.

              
              -Steve

              
              --

              Steve Anthony

              LTS HPC Senior Analyst

              Lehigh University

              sma310@xxxxxxxxxx

              
              _______________________________________________

              ceph-users mailing list

              ceph-users@xxxxxxxxxxxxxx

              http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

            
      -- 
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma310@xxxxxxxxxx
      

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
    -- 
Steve Anthony
LTS HPC Senior Analyst
Lehigh University
sma310@xxxxxxxxxx
  

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com