Re: PG down, due to 3 OSD failing

Dan van der Ster <dvanders@xxxxxxxxx> · Mon, 4 Apr 2022 11:17:33 +0200

BTW -- i've created https://tracker.ceph.com/issues/55169 to ask that
we add some input validation. Injecting such a crush map would ideally
not be possible.

-- dan

On Mon, Apr 4, 2022 at 11:02 AM Dan van der Ster <dvanders@xxxxxxxxx> wrote:
>
> Excellent news!
> After everything is back to active+clean, don't forget to set min_size to 4 :)
>
> have a nice day
>
> On Mon, Apr 4, 2022 at 10:59 AM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote:
> >
> > Yesss! Fixing the choose/chooseleaf thing did make the magic.  :-)
> >
> >    Thanks a lot for your support Dan. Lots of lessons learned from my
> > side, I'm really grateful.
> >
> >    All PGs are now active, will let Ceph rebalance.
> >
> >    Ciao ciao
> >
> >                         Fulvio
> >
> > On 4/4/22 10:50, Dan van der Ster wrote:
> > > Hi Fulvio,
> > >
> > > Yes -- that choose/chooseleaf thing is definitely a problem.. Good catch!
> > > I suggest to fix it and inject the new crush map and see how it goes.
> > >
> > >
> > > Next, in your crush map for the storage type, you have an error:
> > >
> > > # types
> > > type 0 osd
> > > type 1 host
> > > type 2 chassis
> > > type 3 rack
> > > type 4 row
> > > type 5 pdu
> > > type 6 pod
> > > type 7 room
> > > type 8 datacenter
> > > type 9 region
> > > type 10 root
> > > type 11 storage
> > >
> > > The *order* of types is very important in crush -- they must be nested
> > > in the order they appear in the tree. "storage" should therefore be
> > > something between host and osd.
> > > If not, and if you use that type, it can break things.
> > > But since you're not actually using "storage" at the moment, it
> > > probably isn't causing any issue.
> > >
> > > So -- could you go ahead with that chooseleaf fix then let us know how it goes?
> > >
> > > Cheers, Dan
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Apr 4, 2022 at 10:01 AM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote:
> > >>
> > >> Hi again Dan!
> > >> Things are improving, all OSDs are up, but still that one PG is down.
> > >> More info below.
> > >>
> > >> On 4/1/22 19:26, Dan van der Ster wrote:
> > >>>>>> Here is the output of "pg 85.12 query":
> > >>>>>>            https://pastebin.ubuntu.com/p/ww3JdwDXVd/
> > >>>>>>      and its status (also showing the other 85.XX, for reference):
> > >>>>>
> > >>>>> This is very weird:
> > >>>>>
> > >>>>>        "up": [
> > >>>>>            2147483647,
> > >>>>>            2147483647,
> > >>>>>            2147483647,
> > >>>>>            2147483647,
> > >>>>>            2147483647
> > >>>>>        ],
> > >>>>>        "acting": [
> > >>>>>            67,
> > >>>>>            91,
> > >>>>>            82,
> > >>>>>            2147483647,
> > >>>>>            112
> > >>>>>        ],
> > >>
> > >> Meanwhile, since a random PG still shows an output like the above one, I
> > >> think I found the problem with the crush rule: it syas "choose" rather
> > >> than "chooseleaf"!
> > >>
> > >> rule csd-data-pool {
> > >>           id 5
> > >>           type erasure
> > >>           min_size 3
> > >>           max_size 5
> > >>           step set_chooseleaf_tries 5
> > >>           step set_choose_tries 100
> > >>           step take default class big
> > >>           step choose indep 0 type host    <--- HERE!
> > >>           step emit
> > >> }
> > >>
> > >> ...relic of a more complicated, two-step rule... sigh!
> > >>
> > >>> PGs are active if at least 3 shards are up.
> > >>> Our immediate goal remains to get 3 shards up for PG 85.25 (I'm
> > >>> assuming 85.25 remains the one and only PG which is down?)
> > >>
> > >> Yes, 85.25 is still the single 'down' PG.
> > >>
> > >>>> pool 85 'csd-dataonly-ec-pool' erasure size 5 min_size 3 crush_rule 5
> > >>>> object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
> > >>>> last_change 616460 flags
> > >>>> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 12288
> > >>>> application rbd
> > >>>
> > >>> Yup okay, we need to fix that later to make this cluster correctly
> > >>> configured. To be followed up.
> > >>
> > >> At some point, need to update min_size to 4.
> > >>
> > >>>> If I understand correctly, it should now be safe (but I will wait for
> > >>>> your green light) to repeat the same for:
> > >>>> osd.121 chunk 85.11s0
> > >>>> osd.145 chunk 85.33s0
> > >>>>     so they can also start.
> > >>>
> > >>> Yes, please go ahead and do the same.
> > >>> I expect that your PG 85.25 will go active as soon as both those OSDs
> > >>> start correctly.
> > >>
> > >> Hmmm, unfortunately not. All OSDs are up, but 85.25 is still down.
> > >> Its chunks are in:
> > >>
> > >> 85.25s0: osd.64
> > >> 85.25s1: osd.140 osd.159
> > >> 85.25s2: osd.96
> > >> 85.25s3: osd.121 osd.176
> > >> 85.25s4: osd.159 osd.56
> > >>
> > >>> BTW, I also noticed in your crush map below that the down osds have
> > >>> crush weight zero!
> > >>> So -- this means they are the only active OSDs for a PG, and they are
> > >>> all set to be drained.
> > >>> How did this happen? It is also surely part of the root cause here!
> > >>>
> > >>> I suggest to reset the crush weight of those back to what it was
> > >>> before, probably 1 ?
> > >>
> > >> At some point I changed those weight to 0., but this was well after the
> > >> beginning of the problem: this helped, at least, healing a lot of
> > >> degraded/undersized.
> > >>
> > >>> After you have all the PGs active, we need to find out why their "up"
> > >>> set is completely bogus.
> > >>> This is evidence that your crush rule is broken.
> > >>> If a PG doesn't have an complete "up" set, then it can never not be
> > >>> degraded -- the PGs don't know where to go.
> > >>
> > >> Do you think the choose-chooseleaf issue mentioned above, could be the
> > >> culprit?
> > >>
> > >>> I'm curious about that "storage" type you guys invented.
> > >>
> > >> Oh, nothing too fancy... foreword, we happen to be using (and are
> > >> currently finally replacing) hardware (based on FiberChannel-SAN) which
> > >> is not the first choice in the Ceph world: but purchase happened before
> > >> we turned to Ceph as our storage solution. Each OSD server has access to
> > >> 2 such distinct storage systems, hence the idea to describe these
> > >> failure domains in the crush rule.
> > >>
> > >>> Could you please copy to pastebin and share the crush.txt from
> > >>>
> > >>> ceph osd getcrushmap -o crush.map
> > >>> crushtool -d crush.map -o crush.txt
> > >>
> > >> Here it is:
> > >>          https://pastebin.ubuntu.com/p/THkcT6xNgC/
> > >>
> > >>>> Sure! Here it is. For historical reasons there are buckets of type
> > >>>> "storage" which however you can safely ignore as they are no longer
> > >>>> present in any crush_rule.
> > >>>
> > >>> I think they may be relevant, as mentioned earlier.
> > >>>
> > >>>> Please also don't worry about the funny weights, as I am preparing for
> > >>>> hardware replacemente and am freeing up space.
> > >>>
> > >>> As a general rule, never drain osds (never decrease their crush
> > >>> weight) when any PG is degraded.
> > >>> You risk deleting the last copy of a PG!
> > >>
> > >> --
> > >> Fulvio Galeazzi
> > >> GARR-CSD Department
> > >> tel.: +39-334-6533-250
> > >> skype: fgaleazzi70
> >
> > --
> > Fulvio Galeazzi
> > GARR-CSD Department
> > tel.: +39-334-6533-250
> > skype: fgaleazzi70
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx