Re: PG down, due to 3 OSD failing

Dan van der Ster <dvanders@xxxxxxxxx> · Mon, 4 Apr 2022 10:50:20 +0200

Hi Fulvio,

Yes -- that choose/chooseleaf thing is definitely a problem.. Good catch!
I suggest to fix it and inject the new crush map and see how it goes.

Next, in your crush map for the storage type, you have an error:

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
type 11 storage

The *order* of types is very important in crush -- they must be nested
in the order they appear in the tree. "storage" should therefore be
something between host and osd.
If not, and if you use that type, it can break things.
But since you're not actually using "storage" at the moment, it
probably isn't causing any issue.

So -- could you go ahead with that chooseleaf fix then let us know how it goes?

Cheers, Dan

On Mon, Apr 4, 2022 at 10:01 AM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote:
>
> Hi again Dan!
> Things are improving, all OSDs are up, but still that one PG is down.
> More info below.
>
> On 4/1/22 19:26, Dan van der Ster wrote:
> >>>> Here is the output of "pg 85.12 query":
> >>>>           https://pastebin.ubuntu.com/p/ww3JdwDXVd/
> >>>>     and its status (also showing the other 85.XX, for reference):
> >>>
> >>> This is very weird:
> >>>
> >>>       "up": [
> >>>           2147483647,
> >>>           2147483647,
> >>>           2147483647,
> >>>           2147483647,
> >>>           2147483647
> >>>       ],
> >>>       "acting": [
> >>>           67,
> >>>           91,
> >>>           82,
> >>>           2147483647,
> >>>           112
> >>>       ],
>
> Meanwhile, since a random PG still shows an output like the above one, I
> think I found the problem with the crush rule: it syas "choose" rather
> than "chooseleaf"!
>
> rule csd-data-pool {
>          id 5
>          type erasure
>          min_size 3
>          max_size 5
>          step set_chooseleaf_tries 5
>          step set_choose_tries 100
>          step take default class big
>          step choose indep 0 type host    <--- HERE!
>          step emit
> }
>
> ...relic of a more complicated, two-step rule... sigh!
>
> > PGs are active if at least 3 shards are up.
> > Our immediate goal remains to get 3 shards up for PG 85.25 (I'm
> > assuming 85.25 remains the one and only PG which is down?)
>
> Yes, 85.25 is still the single 'down' PG.
>
> >> pool 85 'csd-dataonly-ec-pool' erasure size 5 min_size 3 crush_rule 5
> >> object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
> >> last_change 616460 flags
> >> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 12288
> >> application rbd
> >
> > Yup okay, we need to fix that later to make this cluster correctly
> > configured. To be followed up.
>
> At some point, need to update min_size to 4.
>
> >> If I understand correctly, it should now be safe (but I will wait for
> >> your green light) to repeat the same for:
> >> osd.121 chunk 85.11s0
> >> osd.145 chunk 85.33s0
> >>    so they can also start.
> >
> > Yes, please go ahead and do the same.
> > I expect that your PG 85.25 will go active as soon as both those OSDs
> > start correctly.
>
> Hmmm, unfortunately not. All OSDs are up, but 85.25 is still down.
> Its chunks are in:
>
> 85.25s0: osd.64
> 85.25s1: osd.140 osd.159
> 85.25s2: osd.96
> 85.25s3: osd.121 osd.176
> 85.25s4: osd.159 osd.56
>
> > BTW, I also noticed in your crush map below that the down osds have
> > crush weight zero!
> > So -- this means they are the only active OSDs for a PG, and they are
> > all set to be drained.
> > How did this happen? It is also surely part of the root cause here!
> >
> > I suggest to reset the crush weight of those back to what it was
> > before, probably 1 ?
>
> At some point I changed those weight to 0., but this was well after the
> beginning of the problem: this helped, at least, healing a lot of
> degraded/undersized.
>
> > After you have all the PGs active, we need to find out why their "up"
> > set is completely bogus.
> > This is evidence that your crush rule is broken.
> > If a PG doesn't have an complete "up" set, then it can never not be
> > degraded -- the PGs don't know where to go.
>
> Do you think the choose-chooseleaf issue mentioned above, could be the
> culprit?
>
> > I'm curious about that "storage" type you guys invented.
>
> Oh, nothing too fancy... foreword, we happen to be using (and are
> currently finally replacing) hardware (based on FiberChannel-SAN) which
> is not the first choice in the Ceph world: but purchase happened before
> we turned to Ceph as our storage solution. Each OSD server has access to
> 2 such distinct storage systems, hence the idea to describe these
> failure domains in the crush rule.
>
> > Could you please copy to pastebin and share the crush.txt from
> >
> > ceph osd getcrushmap -o crush.map
> > crushtool -d crush.map -o crush.txt
>
> Here it is:
>         https://pastebin.ubuntu.com/p/THkcT6xNgC/
>
> >> Sure! Here it is. For historical reasons there are buckets of type
> >> "storage" which however you can safely ignore as they are no longer
> >> present in any crush_rule.
> >
> > I think they may be relevant, as mentioned earlier.
> >
> >> Please also don't worry about the funny weights, as I am preparing for
> >> hardware replacemente and am freeing up space.
> >
> > As a general rule, never drain osds (never decrease their crush
> > weight) when any PG is degraded.
> > You risk deleting the last copy of a PG!
>
> --
> Fulvio Galeazzi
> GARR-CSD Department
> tel.: +39-334-6533-250
> skype: fgaleazzi70
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx