Re: PG down, due to 3 OSD failing

Dan van der Ster <dvanders@xxxxxxxxx> · Mon, 4 Apr 2022 11:02:53 +0200

Excellent news!
After everything is back to active+clean, don't forget to set min_size to 4 :)

have a nice day

On Mon, Apr 4, 2022 at 10:59 AM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote:
>
> Yesss! Fixing the choose/chooseleaf thing did make the magic.  :-)
>
>    Thanks a lot for your support Dan. Lots of lessons learned from my
> side, I'm really grateful.
>
>    All PGs are now active, will let Ceph rebalance.
>
>    Ciao ciao
>
>                         Fulvio
>
> On 4/4/22 10:50, Dan van der Ster wrote:
> > Hi Fulvio,
> >
> > Yes -- that choose/chooseleaf thing is definitely a problem.. Good catch!
> > I suggest to fix it and inject the new crush map and see how it goes.
> >
> >
> > Next, in your crush map for the storage type, you have an error:
> >
> > # types
> > type 0 osd
> > type 1 host
> > type 2 chassis
> > type 3 rack
> > type 4 row
> > type 5 pdu
> > type 6 pod
> > type 7 room
> > type 8 datacenter
> > type 9 region
> > type 10 root
> > type 11 storage
> >
> > The *order* of types is very important in crush -- they must be nested
> > in the order they appear in the tree. "storage" should therefore be
> > something between host and osd.
> > If not, and if you use that type, it can break things.
> > But since you're not actually using "storage" at the moment, it
> > probably isn't causing any issue.
> >
> > So -- could you go ahead with that chooseleaf fix then let us know how it goes?
> >
> > Cheers, Dan
> >
> >
> >
> >
> >
> > On Mon, Apr 4, 2022 at 10:01 AM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote:
> >>
> >> Hi again Dan!
> >> Things are improving, all OSDs are up, but still that one PG is down.
> >> More info below.
> >>
> >> On 4/1/22 19:26, Dan van der Ster wrote:
> >>>>>> Here is the output of "pg 85.12 query":
> >>>>>>            https://pastebin.ubuntu.com/p/ww3JdwDXVd/
> >>>>>>      and its status (also showing the other 85.XX, for reference):
> >>>>>
> >>>>> This is very weird:
> >>>>>
> >>>>>        "up": [
> >>>>>            2147483647,
> >>>>>            2147483647,
> >>>>>            2147483647,
> >>>>>            2147483647,
> >>>>>            2147483647
> >>>>>        ],
> >>>>>        "acting": [
> >>>>>            67,
> >>>>>            91,
> >>>>>            82,
> >>>>>            2147483647,
> >>>>>            112
> >>>>>        ],
> >>
> >> Meanwhile, since a random PG still shows an output like the above one, I
> >> think I found the problem with the crush rule: it syas "choose" rather
> >> than "chooseleaf"!
> >>
> >> rule csd-data-pool {
> >>           id 5
> >>           type erasure
> >>           min_size 3
> >>           max_size 5
> >>           step set_chooseleaf_tries 5
> >>           step set_choose_tries 100
> >>           step take default class big
> >>           step choose indep 0 type host    <--- HERE!
> >>           step emit
> >> }
> >>
> >> ...relic of a more complicated, two-step rule... sigh!
> >>
> >>> PGs are active if at least 3 shards are up.
> >>> Our immediate goal remains to get 3 shards up for PG 85.25 (I'm
> >>> assuming 85.25 remains the one and only PG which is down?)
> >>
> >> Yes, 85.25 is still the single 'down' PG.
> >>
> >>>> pool 85 'csd-dataonly-ec-pool' erasure size 5 min_size 3 crush_rule 5
> >>>> object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
> >>>> last_change 616460 flags
> >>>> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 12288
> >>>> application rbd
> >>>
> >>> Yup okay, we need to fix that later to make this cluster correctly
> >>> configured. To be followed up.
> >>
> >> At some point, need to update min_size to 4.
> >>
> >>>> If I understand correctly, it should now be safe (but I will wait for
> >>>> your green light) to repeat the same for:
> >>>> osd.121 chunk 85.11s0
> >>>> osd.145 chunk 85.33s0
> >>>>     so they can also start.
> >>>
> >>> Yes, please go ahead and do the same.
> >>> I expect that your PG 85.25 will go active as soon as both those OSDs
> >>> start correctly.
> >>
> >> Hmmm, unfortunately not. All OSDs are up, but 85.25 is still down.
> >> Its chunks are in:
> >>
> >> 85.25s0: osd.64
> >> 85.25s1: osd.140 osd.159
> >> 85.25s2: osd.96
> >> 85.25s3: osd.121 osd.176
> >> 85.25s4: osd.159 osd.56
> >>
> >>> BTW, I also noticed in your crush map below that the down osds have
> >>> crush weight zero!
> >>> So -- this means they are the only active OSDs for a PG, and they are
> >>> all set to be drained.
> >>> How did this happen? It is also surely part of the root cause here!
> >>>
> >>> I suggest to reset the crush weight of those back to what it was
> >>> before, probably 1 ?
> >>
> >> At some point I changed those weight to 0., but this was well after the
> >> beginning of the problem: this helped, at least, healing a lot of
> >> degraded/undersized.
> >>
> >>> After you have all the PGs active, we need to find out why their "up"
> >>> set is completely bogus.
> >>> This is evidence that your crush rule is broken.
> >>> If a PG doesn't have an complete "up" set, then it can never not be
> >>> degraded -- the PGs don't know where to go.
> >>
> >> Do you think the choose-chooseleaf issue mentioned above, could be the
> >> culprit?
> >>
> >>> I'm curious about that "storage" type you guys invented.
> >>
> >> Oh, nothing too fancy... foreword, we happen to be using (and are
> >> currently finally replacing) hardware (based on FiberChannel-SAN) which
> >> is not the first choice in the Ceph world: but purchase happened before
> >> we turned to Ceph as our storage solution. Each OSD server has access to
> >> 2 such distinct storage systems, hence the idea to describe these
> >> failure domains in the crush rule.
> >>
> >>> Could you please copy to pastebin and share the crush.txt from
> >>>
> >>> ceph osd getcrushmap -o crush.map
> >>> crushtool -d crush.map -o crush.txt
> >>
> >> Here it is:
> >>          https://pastebin.ubuntu.com/p/THkcT6xNgC/
> >>
> >>>> Sure! Here it is. For historical reasons there are buckets of type
> >>>> "storage" which however you can safely ignore as they are no longer
> >>>> present in any crush_rule.
> >>>
> >>> I think they may be relevant, as mentioned earlier.
> >>>
> >>>> Please also don't worry about the funny weights, as I am preparing for
> >>>> hardware replacemente and am freeing up space.
> >>>
> >>> As a general rule, never drain osds (never decrease their crush
> >>> weight) when any PG is degraded.
> >>> You risk deleting the last copy of a PG!
> >>
> >> --
> >> Fulvio Galeazzi
> >> GARR-CSD Department
> >> tel.: +39-334-6533-250
> >> skype: fgaleazzi70
>
> --
> Fulvio Galeazzi
> GARR-CSD Department
> tel.: +39-334-6533-250
> skype: fgaleazzi70
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx