Re: Major ceph disaster

Wido den Hollander <wido@xxxxxxxx> · Tue, 21 May 2019 16:52:59 +0200

On 5/21/19 4:48 PM, Kevin Flöh wrote:
> Hi,
> 
> we gave up on the incomplete pgs since we do not have enough complete
> shards to restore them. What is the procedure to get rid of these pgs?
> 

You need to start with marking the OSDs as 'lost' and then you can
force_create_pg to get the PGs back (empty).

Wido

> regards,
> 
> Kevin
> 
> On 20.05.19 9:22 vorm., Kevin Flöh wrote:
>> Hi Frederic,
>>
>> we do not have access to the original OSDs. We exported the remaining
>> shards of the two pgs but we are only left with two shards (of
>> reasonable size) per pg. The rest of the shards displayed by ceph pg
>> query are empty. I guess marking the OSD as complete doesn't make
>> sense then.
>>
>> Best,
>> Kevin
>>
>> On 17.05.19 2:36 nachm., Frédéric Nass wrote:
>>>
>>>
>>> Le 14/05/2019 à 10:04, Kevin Flöh a écrit :
>>>>
>>>> On 13.05.19 11:21 nachm., Dan van der Ster wrote:
>>>>> Presumably the 2 OSDs you marked as lost were hosting those
>>>>> incomplete PGs?
>>>>> It would be useful to double confirm that: check with `ceph pg <id>
>>>>> query` and `ceph pg dump`.
>>>>> (If so, this is why the ignore history les thing isn't helping; you
>>>>> don't have the minimum 3 stripes up for those 3+1 PGs.)
>>>>
>>>> yes, but as written in my other mail, we still have enough shards,
>>>> at least I think so.
>>>>
>>>>>
>>>>> If those "lost" OSDs by some miracle still have the PG data, you might
>>>>> be able to export the relevant PG stripes with the
>>>>> ceph-objectstore-tool. I've never tried this myself, but there have
>>>>> been threads in the past where people export a PG from a nearly dead
>>>>> hdd, import to another OSD, then backfilling works.
>>>> guess that is not possible.
>>>
>>> Hi Kevin,
>>>
>>> You want to make sure of this.
>>>
>>> Unless you recreated the OSDs 4 and 23 and had new data written on
>>> them, they should still host the data you need.
>>> What Dan suggested (export the 7 inconsistent PGs and import them on
>>> a healthy OSD) seems to be the only way to recover your lost data, as
>>> with 4 hosts and 2 OSDs lost, you're left with 2 chunks of
>>> data/parity when you actually need 3 to access it. Reducing min_size
>>> to 3 will not help.
>>>
>>> Have a look here:
>>>
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html
>>>
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html
>>>
>>>
>>> This is probably the best way you want to follow form now on.
>>>
>>> Regards,
>>> Frédéric.
>>>
>>>>>
>>>>> If OTOH those PGs are really lost forever, and someone else should
>>>>> confirm what I say here, I think the next step would be to force
>>>>> recreate the incomplete PGs then run a set of cephfs scrub/repair
>>>>> disaster recovery cmds to recover what you can from the cephfs.
>>>>>
>>>>> -- dan
>>>>
>>>> would this let us recover at least some of the data on the pgs? If
>>>> not we would just set up a new ceph directly without fixing the old
>>>> one and copy whatever is left.
>>>>
>>>> Best regards,
>>>>
>>>> Kevin
>>>>
>>>>
>>>>
>>>>>
>>>>> On Mon, May 13, 2019 at 4:20 PM Kevin Flöh <kevin.floeh@xxxxxxx>
>>>>> wrote:
>>>>>> Dear ceph experts,
>>>>>>
>>>>>> we have several (maybe related) problems with our ceph cluster,
>>>>>> let me
>>>>>> first show you the current ceph status:
>>>>>>
>>>>>>     cluster:
>>>>>>       id:     23e72372-0d44-4cad-b24f-3641b14b86f4
>>>>>>       health: HEALTH_ERR
>>>>>>               1 MDSs report slow metadata IOs
>>>>>>               1 MDSs report slow requests
>>>>>>               1 MDSs behind on trimming
>>>>>>               1/126319678 objects unfound (0.000%)
>>>>>>               19 scrub errors
>>>>>>               Reduced data availability: 2 pgs inactive, 2 pgs
>>>>>> incomplete
>>>>>>               Possible data damage: 7 pgs inconsistent
>>>>>>               Degraded data redundancy: 1/500333881 objects degraded
>>>>>> (0.000%), 1 pg degraded
>>>>>>               118 stuck requests are blocked > 4096 sec.
>>>>>> Implicated osds
>>>>>> 24,32,91
>>>>>>
>>>>>>     services:
>>>>>>       mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
>>>>>>       mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
>>>>>>       mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3
>>>>>> up:standby
>>>>>>       osd: 96 osds: 96 up, 96 in
>>>>>>
>>>>>>     data:
>>>>>>       pools:   2 pools, 4096 pgs
>>>>>>       objects: 126.32M objects, 260TiB
>>>>>>       usage:   372TiB used, 152TiB / 524TiB avail
>>>>>>       pgs:     0.049% pgs not active
>>>>>>                1/500333881 objects degraded (0.000%)
>>>>>>                1/126319678 objects unfound (0.000%)
>>>>>>                4076 active+clean
>>>>>>                10   active+clean+scrubbing+deep
>>>>>>                7    active+clean+inconsistent
>>>>>>                2    incomplete
>>>>>>                1    active+recovery_wait+degraded
>>>>>>
>>>>>>     io:
>>>>>>       client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr
>>>>>>
>>>>>>
>>>>>> and ceph health detail:
>>>>>>
>>>>>>
>>>>>> HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow
>>>>>> requests;
>>>>>> 1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19
>>>>>> scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs
>>>>>> incomplete; Possible data damage: 7 pgs inconsistent; Degraded data
>>>>>> redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118
>>>>>> stuck requests are blocked > 4096 sec. Implicated osds 24,32,91
>>>>>> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>>>>>>       mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are
>>>>>> blocked > 30 secs, oldest blocked for 351193 secs
>>>>>> MDS_SLOW_REQUEST 1 MDSs report slow requests
>>>>>>       mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are
>>>>>> blocked > 30 sec
>>>>>> MDS_TRIM 1 MDSs behind on trimming
>>>>>>       mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming
>>>>>> (46034/128)
>>>>>> max_segments: 128, num_segments: 46034
>>>>>> OBJECT_UNFOUND 1/126319687 objects unfound (0.000%)
>>>>>>       pg 1.24c has 1 unfound objects
>>>>>> OSD_SCRUB_ERRORS 19 scrub errors
>>>>>> PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs
>>>>>> incomplete
>>>>>>       pg 1.5dd is incomplete, acting [24,4,23,79] (reducing pool ec31
>>>>>> min_size from 3 may help; search ceph.com/docs for 'incomplete')
>>>>>>       pg 1.619 is incomplete, acting [91,23,4,81] (reducing pool ec31
>>>>>> min_size from 3 may help; search ceph.com/docs for 'incomplete')
>>>>>> PG_DAMAGED Possible data damage: 7 pgs inconsistent
>>>>>>       pg 1.17f is active+clean+inconsistent, acting [65,49,25,4]
>>>>>>       pg 1.1e0 is active+clean+inconsistent, acting [11,32,4,81]
>>>>>>       pg 1.203 is active+clean+inconsistent, acting [43,49,4,72]
>>>>>>       pg 1.5d3 is active+clean+inconsistent, acting [37,27,85,4]
>>>>>>       pg 1.779 is active+clean+inconsistent, acting [50,4,77,62]
>>>>>>       pg 1.77c is active+clean+inconsistent, acting [21,49,40,4]
>>>>>>       pg 1.7c3 is active+clean+inconsistent, acting [1,14,68,4]
>>>>>> PG_DEGRADED Degraded data redundancy: 1/500333908 objects degraded
>>>>>> (0.000%), 1 pg degraded
>>>>>>       pg 1.24c is active+recovery_wait+degraded, acting
>>>>>> [32,4,61,36], 1
>>>>>> unfound
>>>>>> REQUEST_STUCK 118 stuck requests are blocked > 4096 sec.
>>>>>> Implicated osds
>>>>>> 24,32,91
>>>>>>       118 ops are blocked > 536871 sec
>>>>>>       osds 24,32,91 have stuck requests > 536871 sec
>>>>>>
>>>>>>
>>>>>> Let me briefly summarize the setup: We have 4 nodes with 24 osds each
>>>>>> and use 3+1 erasure coding. The nodes run on centos7 and we use,
>>>>>> due to
>>>>>> a major mistake when setting up the cluster, more than one ceph
>>>>>> version
>>>>>> on the nodes, 3 nodes run on 12.2.12 and one runs on 13.2.5. We are
>>>>>> currently not daring to update all nodes to 13.2.5. For all the
>>>>>> version
>>>>>> details see:
>>>>>>
>>>>>> {
>>>>>>       "mon": {
>>>>>>           "ceph version 12.2.12
>>>>>> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 3
>>>>>>       },
>>>>>>       "mgr": {
>>>>>>           "ceph version 12.2.12
>>>>>> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 2
>>>>>>       },
>>>>>>       "osd": {
>>>>>>           "ceph version 12.2.12
>>>>>> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 72,
>>>>>>           "ceph version 13.2.5
>>>>>> (cbff874f9007f1869bfd3821b7e33b2a6ffd4988)
>>>>>> mimic (stable)": 24
>>>>>>       },
>>>>>>       "mds": {
>>>>>>           "ceph version 12.2.12
>>>>>> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 4
>>>>>>       },
>>>>>>       "overall": {
>>>>>>           "ceph version 12.2.12
>>>>>> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 81,
>>>>>>           "ceph version 13.2.5
>>>>>> (cbff874f9007f1869bfd3821b7e33b2a6ffd4988)
>>>>>> mimic (stable)": 24
>>>>>>       }
>>>>>> }
>>>>>>
>>>>>> Here is what happened: One osd daemon could not be started and
>>>>>> therefore
>>>>>> we decided to mark the osd as lost and set it up from scratch. Ceph
>>>>>> started recovering and then we lost another osd with the same
>>>>>> behavior.
>>>>>> We did the same as for the first osd. And now we are stuck with 2
>>>>>> pgs in
>>>>>> incomplete. Ceph pg query gives the following problem:
>>>>>>
>>>>>>               "down_osds_we_would_probe": [],
>>>>>>               "peering_blocked_by": [],
>>>>>>               "peering_blocked_by_detail": [
>>>>>>                   {
>>>>>>                       "detail":
>>>>>> "peering_blocked_by_history_les_bound"
>>>>>>                   }
>>>>>>
>>>>>> We already tried to set "osd_find_best_info_ignore_history_les":
>>>>>> "true"
>>>>>> for the affected osds, which had no effect. Furthermore, the
>>>>>> cluster is
>>>>>> behind on trimming by more than 40,000 segments and we have
>>>>>> folders and
>>>>>> files which cannot be deleted or moved. (which are not on the 2
>>>>>> incomplete pgs). Is there any way to solve these problems?
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Kevin
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com