Re: recovery a downed/inaccessible pg

Eugen Block <eblock@xxxxxx> · Tue, 31 Dec 2024 09:41:27 +0000

Hi,

did you make any progress on this or is the cluster still down? Is  
your failure-domain "osd"? With EC k8m4 and one host with an issue I  
wouldn't expect that outcome if your failure-domain would be "host". I  
would recommend to check (and fix) that after you recovered the cluster.
I'm not sure why you marked those OSDs as lost, you had only down OSD?  
Did you find out why they failed to start? It's unclear to me how bad  
it is now that you marked the OSDs as lost. I assume that in this  
case, trying to export/import those PGs from the down OSDs is the only  
chance left. But since you marked them lost, I'm not sure how  
promising that is.

After the first OSDs were marked "out", did the recovery progress? Did  
you let it finish before you marked the first OSD as "lost"?

Regards,
Eugen

Zitat von Nick Anderson <ande3707@xxxxxxxxx>:

Hello ceph user-community,

We have a reef (18.2.2) cluster with an erasure coded pool of k=8, m=4 that
has a pg that is currently down with the following:

```
WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg
incomplete
    pg 2.699 is remapped+incomplete, acting
[NONE,NONE,NONE,NONE,NONE,353,488,333,408,282,167,145] (reducing pool
i1-sea-rbd-01-data min_size from 9 may help; search ceph.com/docs for
'incomplete')
```

Timeline leading up to our current state noted above.

We had an issue with a host that had a flappy network and during our
efforts to restore the network some prolonged OSD heartbeats not reachable
seems to have caused two OSDs to go down/out.  The OSD containers were
stuck in a restart loop and eventually were auto-out'd from the cluster.
Restarting the affected OSD containers had no effect and the pg was stuck
in a "down+remapped state."

```
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg down
    pg 2.699 is down+remapped, acting
[NONE,NONE,NONE,NONE,613,353,488,333,408,282,167,145]
```

Looking over the `pg query 2.699` we could the following:

```
 "blocked": "peering is blocked due to down osds",
            "down_osds_we_would_probe": [
                38,
                120
            ],
            "peering_blocked_by": [
                {
                    "osd": 38,
                    "current_lost_at": 90605,
                    "comment": "starting or marking this osd lost may let
us proceed"
                },
                {
                    "osd": 120,
                    "current_lost_at": 0,
                    "comment": "starting or marking this osd lost may let
us proceed"
                },
```

With that output and reading ceph documentation[1], we decided to mark
osd.38 as lost (`ceph osd lost 38`).  Within a couple minutes, a 3rd OSD
went down in osd.613 and exhibited the same restart loop behavior.

Leaving us with the following state from a ` ceph pg query 2.699`

```
            "blocked": "peering is blocked due to down osds",
            "down_osds_we_would_probe": [
                38,
                120,
                613
            ],
            "peering_blocked_by": [
                {
                    "osd": 38,
                    "current_lost_at": 90605,
                    "comment": "starting or marking this osd lost may let
us proceed"
                },
                {
                    "osd": 120,
                    "current_lost_at": 0,
                    "comment": "starting or marking this osd lost may let
us proceed"
                },
                {
                    "osd": 613,
                    "current_lost_at": 0,
                    "comment": "starting or marking this osd lost may let
us proceed"
                }
            ]
        },
        {
            "name": "Started",
            "enter_time": "2024-12-21T17:13:18.435769+0000"
        }
    ],
    "agent_state": {}
```

After sleeping on it and letting the cluster recover from another failed
OSD, we decided to mark all three as lost (osd.38, osd.120, and osd.613).
Unfortunately, this caused two additional OSDs (osd.226 and osd.339) to go
down.  This changed the pg state from `down+remapped`
to`remapped+incomplete`.

We now have 5 down/out OSDs and are wondering if there is anyone to recover
this PG?  This one PG has caused our 1.6 PiB to be inaccessible..

We've gleaned some documentation[2] involving exporting the affected pg
data with ceph-objectstore-tool and importing into a new OSD.  Is this our
only route to possibly recover this pg and its data?

The strange thing is the OSDs underlying disk seems fine in that the ceph
volume-group/LVM are intact.  There are 0 issues with the disks from what I
can glean or any indication of disk issues (the last 3 disks that were
outed were part of a successful recovery/backfill with no issues).  I
tested the two original outed OSDs and the underlying disks all pass
smartctl short tests and have seen no indication of hardware issues with
these spinning disks.

Looking over the ceph documentation, it looks like there is a newer tool
that is cephadm/bluestore friendly in ceph-bluestore-tool[3] that does the
same job as ceph-objectstore-tool?

We are hoping there is a way to recover this PG and appreciate any advice.
Currently our production cluster is down and are looking for recovery
avenues.

- Nick

Links:
[1]
https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#placement-group-down-peering-failure
[2]
https://www.croit.io/blog/how-to-recover-inactive-pgs-using-ceph-objectstore-tool-on-ceph-clusters
[3] https://docs.ceph.com/en/reef/man/8/ceph-bluestore-tool/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx