Re: Major ceph disaster

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 13 May 2019 23:21:47 +0200

Presumably the 2 OSDs you marked as lost were hosting those incomplete PGs?
It would be useful to double confirm that: check with `ceph pg <id>
query` and `ceph pg dump`.
(If so, this is why the ignore history les thing isn't helping; you
don't have the minimum 3 stripes up for those 3+1 PGs.)

If those "lost" OSDs by some miracle still have the PG data, you might
be able to export the relevant PG stripes with the
ceph-objectstore-tool. I've never tried this myself, but there have
been threads in the past where people export a PG from a nearly dead
hdd, import to another OSD, then backfilling works.

If OTOH those PGs are really lost forever, and someone else should
confirm what I say here, I think the next step would be to force
recreate the incomplete PGs then run a set of cephfs scrub/repair
disaster recovery cmds to recover what you can from the cephfs.

-- dan

On Mon, May 13, 2019 at 4:20 PM Kevin Flöh <kevin.floeh@xxxxxxx> wrote:
>
> Dear ceph experts,
>
> we have several (maybe related) problems with our ceph cluster, let me
> first show you the current ceph status:
>
>    cluster:
>      id:     23e72372-0d44-4cad-b24f-3641b14b86f4
>      health: HEALTH_ERR
>              1 MDSs report slow metadata IOs
>              1 MDSs report slow requests
>              1 MDSs behind on trimming
>              1/126319678 objects unfound (0.000%)
>              19 scrub errors
>              Reduced data availability: 2 pgs inactive, 2 pgs incomplete
>              Possible data damage: 7 pgs inconsistent
>              Degraded data redundancy: 1/500333881 objects degraded
> (0.000%), 1 pg degraded
>              118 stuck requests are blocked > 4096 sec. Implicated osds
> 24,32,91
>
>    services:
>      mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
>      mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
>      mds: cephfs-1/1/1 up  {0=ceph-node02.etp.kit.edu=up:active}, 3
> up:standby
>      osd: 96 osds: 96 up, 96 in
>
>    data:
>      pools:   2 pools, 4096 pgs
>      objects: 126.32M objects, 260TiB
>      usage:   372TiB used, 152TiB / 524TiB avail
>      pgs:     0.049% pgs not active
>               1/500333881 objects degraded (0.000%)
>               1/126319678 objects unfound (0.000%)
>               4076 active+clean
>               10   active+clean+scrubbing+deep
>               7    active+clean+inconsistent
>               2    incomplete
>               1    active+recovery_wait+degraded
>
>    io:
>      client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr
>
>
> and ceph health detail:
>
>
> HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests;
> 1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19
> scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs
> incomplete; Possible data damage: 7 pgs inconsistent; Degraded data
> redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118
> stuck requests are blocked > 4096 sec. Implicated osds 24,32,91
> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>      mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are
> blocked > 30 secs, oldest blocked for 351193 secs
> MDS_SLOW_REQUEST 1 MDSs report slow requests
>      mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are blocked > 30 sec
> MDS_TRIM 1 MDSs behind on trimming
>      mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46034/128)
> max_segments: 128, num_segments: 46034
> OBJECT_UNFOUND 1/126319687 objects unfound (0.000%)
>      pg 1.24c has 1 unfound objects
> OSD_SCRUB_ERRORS 19 scrub errors
> PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete
>      pg 1.5dd is incomplete, acting [24,4,23,79] (reducing pool ec31
> min_size from 3 may help; search ceph.com/docs for 'incomplete')
>      pg 1.619 is incomplete, acting [91,23,4,81] (reducing pool ec31
> min_size from 3 may help; search ceph.com/docs for 'incomplete')
> PG_DAMAGED Possible data damage: 7 pgs inconsistent
>      pg 1.17f is active+clean+inconsistent, acting [65,49,25,4]
>      pg 1.1e0 is active+clean+inconsistent, acting [11,32,4,81]
>      pg 1.203 is active+clean+inconsistent, acting [43,49,4,72]
>      pg 1.5d3 is active+clean+inconsistent, acting [37,27,85,4]
>      pg 1.779 is active+clean+inconsistent, acting [50,4,77,62]
>      pg 1.77c is active+clean+inconsistent, acting [21,49,40,4]
>      pg 1.7c3 is active+clean+inconsistent, acting [1,14,68,4]
> PG_DEGRADED Degraded data redundancy: 1/500333908 objects degraded
> (0.000%), 1 pg degraded
>      pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1
> unfound
> REQUEST_STUCK 118 stuck requests are blocked > 4096 sec. Implicated osds
> 24,32,91
>      118 ops are blocked > 536871 sec
>      osds 24,32,91 have stuck requests > 536871 sec
>
>
> Let me briefly summarize the setup: We have 4 nodes with 24 osds each
> and use 3+1 erasure coding. The nodes run on centos7 and we use, due to
> a major mistake when setting up the cluster, more than one ceph version
> on the nodes, 3 nodes run on 12.2.12 and one runs on 13.2.5. We are
> currently not daring to update all nodes to 13.2.5. For all the version
> details see:
>
> {
>      "mon": {
>          "ceph version 12.2.12
> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 3
>      },
>      "mgr": {
>          "ceph version 12.2.12
> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 2
>      },
>      "osd": {
>          "ceph version 12.2.12
> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 72,
>          "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988)
> mimic (stable)": 24
>      },
>      "mds": {
>          "ceph version 12.2.12
> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 4
>      },
>      "overall": {
>          "ceph version 12.2.12
> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 81,
>          "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988)
> mimic (stable)": 24
>      }
> }
>
> Here is what happened: One osd daemon could not be started and therefore
> we decided to mark the osd as lost and set it up from scratch. Ceph
> started recovering and then we lost another osd with the same behavior.
> We did the same as for the first osd. And now we are stuck with 2 pgs in
> incomplete. Ceph pg query gives the following problem:
>
>              "down_osds_we_would_probe": [],
>              "peering_blocked_by": [],
>              "peering_blocked_by_detail": [
>                  {
>                      "detail": "peering_blocked_by_history_les_bound"
>                  }
>
> We already tried to set "osd_find_best_info_ignore_history_les": "true"
> for the affected osds, which had no effect. Furthermore, the cluster is
> behind on trimming by more than 40,000 segments and we have folders and
> files which cannot be deleted or moved. (which are not on the 2
> incomplete pgs). Is there any way to solve these problems?
>
> Best regards,
>
> Kevin
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com