Re: PG down, due to 3 OSD failing

Dan van der Ster <dvanders@xxxxxxxxx> · Fri, 1 Apr 2022 11:25:27 +0200

The PGs are stale, down, inactive *because* the OSDs don't start.
Your main efforts should be to bring OSDs up, without purging or
zapping or anyting like that.
(Currently your cluster is down, but there are hopes to recover. If
you start purging things that can result in permanent data loss.).

More below.

On Fri, Apr 1, 2022 at 9:38 AM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote:
>
> Ciao Dan,
>      thanks for your time!
>
> So you are suggesting that my problems with PG 85.25 may somehow resolve
> if I manage to bring up the three OSDs currently "down" (possibly due to
> PG 85.12, and other PGs)?

Yes, that's exactly what I'm suggesting.

> Looking for the string 'start interval does not contain the required
> bound' I found similar errors in the three OSDs:
> osd.158: 85.12s0
> osd.145: 85.33s0
> osd.121: 85.11s0

Is that log also for PG 85.12 on the other OSDs?

> Here is the output of "pg 85.12 query":
>         https://pastebin.ubuntu.com/p/ww3JdwDXVd/
>   and its status (also showing the other 85.XX, for reference):

This is very weird:

    "up": [
        2147483647,
        2147483647,
        2147483647,
        2147483647,
        2147483647
    ],
    "acting": [
        67,
        91,
        82,
        2147483647,
        112
    ],

Right now, do the following:
  ceph osd set norebalance
That will prevent PGs moving from one OSD to another *unless* they are degraded.

2. My theory about what happened here. Your crush rule change "osd ->
host" below basically asked all PGs to be moved.
Some glitch happened and some broken parts of PG 85.12 ended up on
some OSDs, now causing those OSDs to crash.
85.12 is "fine", I mean active, now because there are enough complete
parts of it on other osds.
The fact that "up" above is listing '2147483647' for every osd means
your new crush rule is currently broken. Let's deal with fixing that
later.

3. Question -- what is the output of `ceph osd pool ls detail | grep
csd-dataonly-ec-pool` ? If you have `min_size 3` there, then this is
part of the root cause of the outage here. At the end of this thread,
*only after everything is recovered and no PGs are
undersized/degraded* , you will need to set it `ceph osd pool set
csd-dataonly-ec-pool min_size 4`

4. The immediate goal should be to try to get osd.158 to start up, by
"removing" the corrupted part of PG 85.12 from it.
IF we can get osd.158 started, then the same approach should work for
the other OSDs.