Re: Slow Request on OSD

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks Wido.  Reed and I have been working together to try to restore this cluster for about 3 weeks now.  I have been accumulating a number of failure modes that I am hoping to share with the Ceph group soon, but have been holding off a bit until we see the full picture clearly so that we can provide some succinct observations.

We know that losing 6 of 8 OSDs was definitely going to result in data loss, so I think we are resigned to that.  What has been difficult for us is that there have been many steps in the rebuild process that seem to get stuck and need our intervention.  But it is not 100% obvious what interventions we should applying.

My very over-simplied hope was this:

  1. We would remove the corrupted OSDs from the cluster
  2. We would replace them with new OSDs
  3. Ceph would figure out that a lot of PGs were lost
  4. We would "agree and say okay -- lose the objects/files"
  5. The cluster would use what remains and return to working state

I feel we have done something wrong along the way, and at this point we are trying to figure out how to do step #4 completely.  We are about to follow the steps to "mark unfound lost", which makes sense to me... but I'm not sure what to do about all the other inconsistencies.

What procedure do we need to follow to just tell Ceph "those PGs are lost, let's move on"?

===

A very quick history of what we did to get here:

  1. 8 OSDs lost power simultaneously.
  2. 2 OSDs came back without issues.
  3. 1 OSD wouldn't start (various assertion failures), but we were able to copy its PGs to a new OSD as follows:
    1. ceph-objectstore-tool "export"
    2. ceph osd crush rm osd.N
    3. ceph auth del osd.N
    4. ceph os rm osd.N
    5. Create new OSD from scrach (it got a new OSD ID)
    6. ceph-objectstore-tool "import"
  4. The remaining 5 OSDs were corrupt beyond repair (could not export, mostly due to missing leveldb files after xfs_repair).  We redeployed them as follows:
    1. ceph osd crush rm osd.N
    2. ceph auth del osd.N
    3. ceph os rm osd.N
    4. Create new OSD from scratch (it got the same OSD ID as the old OSD)

All the new OSDs from #4.4 ended up getting the same OSD ID as the original OSD.  Don't know if that is part of the problem?  It seems like doing the "crush rm" should have advised the cluster correctly, but perhaps not?

Where did we go wrong in the recovery process?

Thank you!

-- Dan

On Sep 1, 2016, at 00:18, Wido den Hollander <wido@xxxxxxxx> wrote:


Op 31 augustus 2016 om 23:21 schreef Reed Dier <reed.dier@xxxxxxxxxxx>:


Multiple XFS corruptions, multiple leveldb issues. Looked to be result of write cache settings which have been adjusted now.


That is bad news, really bad.

You’ll see below that there are tons of PG’s in bad states, and it was slowly but surely bringing the number of bad PGs down, but it seems to have hit a brick wall with this one slow request operation.


No, you have more issues. You can 17 PGs which are incomplete, a few down+incomplete.

Without those PGs functioning (active+X) your MDS will probably not work.

Take a look at: http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/

Make sure you go to HEALTH_WARN at first, in HEALTH_ERR the MDS will never come online.

Wido

ceph -s
cluster []
    health HEALTH_ERR
           292 pgs are stuck inactive for more than 300 seconds
           142 pgs backfill_wait
           135 pgs degraded
           63 pgs down
           80 pgs incomplete
           199 pgs inconsistent
           2 pgs recovering
           5 pgs recovery_wait
           1 pgs repair
           132 pgs stale
           160 pgs stuck inactive
           132 pgs stuck stale
           71 pgs stuck unclean
           128 pgs undersized
           1 requests are blocked > 32 sec
           recovery 5301381/46255447 objects degraded (11.461%)
           recovery 6335505/46255447 objects misplaced (13.697%)
           recovery 131/20781800 unfound (0.001%)
           14943 scrub errors
           mds cluster is degraded
    monmap e1: 3 mons at {core=[]:6789/0,db=[]:6789/0,dev=[]:6789/0}
           election epoch 262, quorum 0,1,2 core,dev,db
     fsmap e3627: 1/1/1 up {0=core=up:replay}
    osdmap e3685: 8 osds: 8 up, 8 in; 153 remapped pgs
           flags sortbitwise
     pgmap v1807138: 744 pgs, 10 pools, 7668 GB data, 20294 kobjects
           8998 GB used, 50598 GB / 59596 GB avail
           5301381/46255447 objects degraded (11.461%)
           6335505/46255447 objects misplaced (13.697%)
           131/20781800 unfound (0.001%)
                209 active+clean
                170 active+clean+inconsistent
                112 stale+active+clean
                 74 undersized+degraded+remapped+wait_backfill+peered
                 63 down+incomplete
                 48 active+undersized+degraded+remapped+wait_backfill
                 19 stale+active+clean+inconsistent
                 17 incomplete
                 12 active+remapped+wait_backfill
                  5 active+recovery_wait+degraded
                  4 undersized+degraded+remapped+inconsistent+wait_backfill+peered
                  4 active+remapped+inconsistent+wait_backfill
                  2 active+recovering+degraded
                  2 undersized+degraded+remapped+peered
                  1 stale+active+clean+scrubbing+deep+inconsistent+repair
                  1 active+clean+scrubbing+deep
                  1 active+clean+scrubbing+inconsistent


Thanks,

Reed

On Aug 31, 2016, at 4:08 PM, Wido den Hollander <wido@xxxxxxxx> wrote:


Op 31 augustus 2016 om 22:56 schreef Reed Dier <reed.dier@xxxxxxxxxxx <mailto:reed.dier@xxxxxxxxxxx>>:


After a power failure left our jewel cluster crippled, I have hit a sticking point in attempted recovery.

Out of 8 osd’s, we likely lost 5-6, trying to salvage what we can.


That's probably to much. How do you mean lost? Is XFS crippled/corrupted? That shouldn't happen.

In addition to rados pools, we were also using CephFS, and the cephfs.metadata and cephfs.data pools likely lost plenty of PG’s.


What is the status of all PGs? What does 'ceph -s' show?

Are all PGs active? Since that's something which needs to be done first.

The mds has reported this ever since returning from the power loss:
# ceph mds stat
e3627: 1/1/1 up {0=core=up:replay}


When looking at the slow request on the osd, it shows this task which I can’t quite figure out. Any help appreciated.


Are all clients (including MDS) and OSDs running the same version?

Wido

# ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok dump_ops_in_flight
{
  "ops": [
      {
          "description": "osd_op(mds.0.3625:8 6.c5265ab3 (undecoded) ack+retry+read+known_if_redirected+full_force e3668)",
          "initiated_at": "2016-08-31 10:37:18.833644",
          "age": 22212.235361,
          "duration": 22212.235379,
          "type_data": [
              "no flag points reached",
              [
                  {
                      "time": "2016-08-31 10:37:18.833644",
                      "event": "initiated"
                  }
              ]
          ]
      }
  ],
  "num_ops": 1
}

Thanks,

Reed
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux