Re: PG down, due to 3 OSD failing

Dan van der Ster <dvanders@xxxxxxxxx> · Fri, 1 Apr 2022 19:26:48 +0200

We're on the right track!

On Fri, Apr 1, 2022 at 6:57 PM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote:
>
> Ciao Dan, thanks for your messages!
>
> On 4/1/22 11:25, Dan van der Ster wrote:
> > The PGs are stale, down, inactive *because* the OSDs don't start.
> > Your main efforts should be to bring OSDs up, without purging or
> > zapping or anyting like that.
> > (Currently your cluster is down, but there are hopes to recover. If
> > you start purging things that can result in permanent data loss.).
>
> Sure, will not do anything like purge/whatever, as long as I can abuse
> your patience...
>
>
> >> Looking for the string 'start interval does not contain the required
> >> bound' I found similar errors in the three OSDs:
> >> osd.158: 85.12s0
> >> osd.145: 85.33s0
> >> osd.121: 85.11s0
> >
> > Is that log also for PG 85.12 on the other OSDs?
>
> Not sure I am getting your point here, sorry. I grep'ed that string in
> the above logs, and only found the occurrences I mentioned. To be
> specific, reference to 85.12 was found only on osd.158 and not on the
> other 'down' OSDs.
>

Sorry, my question was confusing, because I didn't notice you already
mentioned which PG shards are relevant for crashing each PG.
Just ignore the question. more below...

> >> Here is the output of "pg 85.12 query":
> >>          https://pastebin.ubuntu.com/p/ww3JdwDXVd/
> >>    and its status (also showing the other 85.XX, for reference):
> >
> > This is very weird:
> >
> >      "up": [
> >          2147483647,
> >          2147483647,
> >          2147483647,
> >          2147483647,
> >          2147483647
> >      ],
> >      "acting": [
> >          67,
> >          91,
> >          82,
> >          2147483647,
> >          112
> >      ],
> >
> > Right now, do the following:
> >    ceph osd set norebalance
> > That will prevent PGs moving from one OSD to another *unless* they are degraded.
>
> Done

Great, keep it like that for a while until we understand the "crush"
issue, which is different from the osds crashing issue.

>
> > 2. My theory about what happened here. Your crush rule change "osd ->
> > host" below basically asked all PGs to be moved.
> > Some glitch happened and some broken parts of PG 85.12 ended up on
> > some OSDs, now causing those OSDs to crash.
> > 85.12 is "fine", I mean active, now because there are enough complete
> > parts of it on other osds.
> > The fact that "up" above is listing '2147483647' for every osd means
> > your new crush rule is currently broken. Let's deal with fixing that
> > later.
>
> Hmm, in theory, it looks correct, but I see your point and in fact I am
> stuck with some 1-3% fraction of the objects misplaced/degraded, all of
> them in pool 85
>
<snip>

PGs are active if at least 3 shards are up.
Our immediate goal remains to get 3 shards up for PG 85.25 (I'm
assuming 85.25 remains the one and only PG which is down?)

> > 3. Question -- what is the output of `ceph osd pool ls detail | grep
> > csd-dataonly-ec-pool` ? If you have `min_size 3` there, then this is
> > part of the root cause of the outage here. At the end of this thread,
> > *only after everything is recovered and no PGs are
> > undersized/degraded* , you will need to set it `ceph osd pool set
> > csd-dataonly-ec-pool min_size 4`
>
> Indeed, it's 3. Connected to your last point below (never mess with
> crush rules if there is anything ongoing), during rebalancing there was
> something which was stuck and I think "health detail" was suggesting
> that reducing min-size would help. I took not of the pools for which I
> updated the parameter, and will go back to the proper values once the
> situation will be clean.
>
> pool 85 'csd-dataonly-ec-pool' erasure size 5 min_size 3 crush_rule 5
> object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
> last_change 616460 flags
> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 12288
> application rbd

Yup okay, we need to fix that later to make this cluster correctly
configured. To be followed up.

>
> > 4. The immediate goal should be to try to get osd.158 to start up, by
> > "removing" the corrupted part of PG 85.12 from it.
> > IF we can get osd.158 started, then the same approach should work for
> > the other OSDs.
> >  From your previous log, osd.158 has a broken piece of pg 85.12. Let's
> > export-remove it:
> >
> > ceph-objectstore-tool  --data-path /var/lib/ceph/osd/cephpa1-158/
> > --op export-remove --pgid 85.12s0 > osd.158-85.12s0.bin
> >
> > Please do that, then try to start osd.158, and report back here.
>
> Did that, and osd.158 is now UP, thanks! I think the output of "ceph -s"
> did not change but that's a consequence of norebalance, I guess.

Nope -- don't touch norebalance yet, it's not relevant here.
You should check the status of pg 85.25 now -- `ceph pg 85.25 query`.
I hope it now mentions osd.158 as active for one of the shards?

> If I understand correctly, it should now be safe (but I will wait for
> your green light) to repeat the same for:
> osd.121 chunk 85.11s0
> osd.145 chunk 85.33s0
>   so they can also start.

Yes, please go ahead and do the same.
I expect that your PG 85.25 will go active as soon as both those OSDs
start correctly.

BTW, I also noticed in your crush map below that the down osds have
crush weight zero!
So -- this means they are the only active OSDs for a PG, and they are
all set to be drained.
How did this happen? It is also surely part of the root cause here!

I suggest to reset the crush weight of those back to what it was
before, probably 1 ?

> And once started, I can clear the
> "norebalance" flag, correct?

no! Not at all.

After you have all the PGs active, we need to find out why their "up"
set is completely bogus.
This is evidence that your crush rule is broken.
If a PG doesn't have an complete "up" set, then it can never not be
degraded -- the PGs don't know where to go.

I'm curious about that "storage" type you guys invented.

Could you please copy to pastebin and share the crush.txt from

ceph osd getcrushmap -o crush.map
crushtool -d crush.map -o crush.txt

>
> > Two more questions below...
> >
> >>
> >> 85.11    39501        0         0       0 165479411712           0
> >>       0 3000                  stale+active+clean    3d    606021'532631
> >>     617659:1827554
> >> [124,157,68,72,102]p124
> >> [124,157,68,72,102]p124 2022-03-28 07:21:00.566032 2022-03-28
> >> 07:21:00.566032
> >> 85.12    39704    39704    158816       0 166350008320           0
> >>       0 3028 active+undersized+degraded+remapped    3d    606021'573200
> >>     620336:1839924
> >> [2147483647,2147483647,2147483647,2147483647,2147483647]p-1
> >>              [67,91,82,2147483647,112]p67 2022-03-15 03:25:28.478280
> >> 2022-03-12 19:10:45.866650
> >> 85.25    39402        0         0       0 165108592640           0
> >>       0 3098                 stale+down+remapped    3d    606021'521273
> >>     618930:1734492
> >> [2147483647,2147483647,2147483647,2147483647,2147483647]p-1
> >> [2147483647,2147483647,96,2147483647,2147483647]p96 2022-03-15
> >> 04:08:42.561720 2022-03-09 17:05:34.205121
> >> 85.33    39319        0         0       0 164740796416           0
> >>       0 3000                  stale+active+clean    3d    606021'513259
> >>     617659:2125167
> >> [174,112,85,102,124]p174
> >> [174,112,85,102,124]p174 2022-03-28 07:21:12.097873 2022-03-28
> >> 07:21:12.097873
> >>
> >> So 85.11 and 85.33 do not look bad, after all: why are the relevant OSDs
> >> complaining? Is there a way to force them (OSDs) to forget about the
> >> chunks they possess, as apparently those have already safely migrated
> >> elsewhere?
> >>
> >> Indeed 85.12 is not really healthy...
> >> As for chunks of 85.12 and 85.25, the 3 down OSDs have:
> >> osd.121
> >>          85.12s3
> >>          85.25s3
> >> osd.158
> >>          85.12s0
> >> osd.145
> >>          none
> >> I guess I can safely purge osd.145 and re-create it, then.
> >
> > No!!! It contains crucial data for *other* PGs!
>
> Ok! :-)
>
> >> As for the history of the pool, this is an EC pool with metadata in a
> >> SSD-backed replicated pool. At some point I realized I had made a
> >> mistake in the allocation rule for the "data" part, so I changed the
> >> relevant rule to:
> >>
> >> ~]$ ceph --cluster cephpa1 osd lspools | grep 85
> >> 85 csd-dataonly-ec-pool
> >> ~]$ ceph --cluster cephpa1 osd pool get csd-dataonly-ec-pool crush_rule
> >> crush_rule: csd-data-pool
> >>
> >> rule csd-data-pool {
> >>           id 5
> >>           type erasure
> >>           min_size 3
> >>           max_size 5
> >>           step set_chooseleaf_tries 5
> >>           step set_choose_tries 100
> >>           step take default class big
> >>           step choose indep 0 type host  <--- this was "osd", before
> >>           step emit
> >> }
> >
> > Can you please share the output of `ceph osd tree` ?
> >
> > We need to understand why crush is not working any more for your pool.
>
> Sure! Here it is. For historical reasons there are buckets of type
> "storage" which however you can safely ignore as they are no longer
> present in any crush_rule.

I think they may be relevant, as mentioned earlier.

> Please also don't worry about the funny weights, as I am preparing for
> hardware replacemente and am freeing up space.

As a general rule, never drain osds (never decrease their crush
weight) when any PG is degraded.
You risk deleting the last copy of a PG!

<snip>
>
> >> At the time I changed the rule, there was no 'down' PG, all PGs in the
> >> cluster were 'active' plus possibly some other state (remapped,
> >> degraded, whatever) as I had added some new disk servers few days before.
> >
> > Never make crush rule changes when any PG is degraded, remapped, or whatever!
> > They must all be active+clean to consider big changes like injecting a
> > new crush rule!!
>
> Ok, now I think I learned it. In my mind it was a sort of optimization:
> as I was moving stuff around due to the additional servers, why not at
> the same time update the crush rule?
> Will remember the lesson for the future.

Yup, one thing at a time ;)

Thanks!

-- dan

>
>    Thanks!
>
>                         Fulvio
>
> --
> Fulvio Galeazzi
> GARR-CSD Department
> tel.: +39-334-6533-250
> skype: fgaleazzi70
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx