BTW -- i've created https://tracker.ceph.com/issues/55169 to ask that we add some input validation. Injecting such a crush map would ideally not be possible. -- dan On Mon, Apr 4, 2022 at 11:02 AM Dan van der Ster <dvanders@xxxxxxxxx> wrote: > > Excellent news! > After everything is back to active+clean, don't forget to set min_size to 4 :) > > have a nice day > > On Mon, Apr 4, 2022 at 10:59 AM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote: > > > > Yesss! Fixing the choose/chooseleaf thing did make the magic. :-) > > > > Thanks a lot for your support Dan. Lots of lessons learned from my > > side, I'm really grateful. > > > > All PGs are now active, will let Ceph rebalance. > > > > Ciao ciao > > > > Fulvio > > > > On 4/4/22 10:50, Dan van der Ster wrote: > > > Hi Fulvio, > > > > > > Yes -- that choose/chooseleaf thing is definitely a problem.. Good catch! > > > I suggest to fix it and inject the new crush map and see how it goes. > > > > > > > > > Next, in your crush map for the storage type, you have an error: > > > > > > # types > > > type 0 osd > > > type 1 host > > > type 2 chassis > > > type 3 rack > > > type 4 row > > > type 5 pdu > > > type 6 pod > > > type 7 room > > > type 8 datacenter > > > type 9 region > > > type 10 root > > > type 11 storage > > > > > > The *order* of types is very important in crush -- they must be nested > > > in the order they appear in the tree. "storage" should therefore be > > > something between host and osd. > > > If not, and if you use that type, it can break things. > > > But since you're not actually using "storage" at the moment, it > > > probably isn't causing any issue. > > > > > > So -- could you go ahead with that chooseleaf fix then let us know how it goes? > > > > > > Cheers, Dan > > > > > > > > > > > > > > > > > > On Mon, Apr 4, 2022 at 10:01 AM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote: > > >> > > >> Hi again Dan! > > >> Things are improving, all OSDs are up, but still that one PG is down. > > >> More info below. > > >> > > >> On 4/1/22 19:26, Dan van der Ster wrote: > > >>>>>> Here is the output of "pg 85.12 query": > > >>>>>> https://pastebin.ubuntu.com/p/ww3JdwDXVd/ > > >>>>>> and its status (also showing the other 85.XX, for reference): > > >>>>> > > >>>>> This is very weird: > > >>>>> > > >>>>> "up": [ > > >>>>> 2147483647, > > >>>>> 2147483647, > > >>>>> 2147483647, > > >>>>> 2147483647, > > >>>>> 2147483647 > > >>>>> ], > > >>>>> "acting": [ > > >>>>> 67, > > >>>>> 91, > > >>>>> 82, > > >>>>> 2147483647, > > >>>>> 112 > > >>>>> ], > > >> > > >> Meanwhile, since a random PG still shows an output like the above one, I > > >> think I found the problem with the crush rule: it syas "choose" rather > > >> than "chooseleaf"! > > >> > > >> rule csd-data-pool { > > >> id 5 > > >> type erasure > > >> min_size 3 > > >> max_size 5 > > >> step set_chooseleaf_tries 5 > > >> step set_choose_tries 100 > > >> step take default class big > > >> step choose indep 0 type host <--- HERE! > > >> step emit > > >> } > > >> > > >> ...relic of a more complicated, two-step rule... sigh! > > >> > > >>> PGs are active if at least 3 shards are up. > > >>> Our immediate goal remains to get 3 shards up for PG 85.25 (I'm > > >>> assuming 85.25 remains the one and only PG which is down?) > > >> > > >> Yes, 85.25 is still the single 'down' PG. > > >> > > >>>> pool 85 'csd-dataonly-ec-pool' erasure size 5 min_size 3 crush_rule 5 > > >>>> object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn > > >>>> last_change 616460 flags > > >>>> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 12288 > > >>>> application rbd > > >>> > > >>> Yup okay, we need to fix that later to make this cluster correctly > > >>> configured. To be followed up. > > >> > > >> At some point, need to update min_size to 4. > > >> > > >>>> If I understand correctly, it should now be safe (but I will wait for > > >>>> your green light) to repeat the same for: > > >>>> osd.121 chunk 85.11s0 > > >>>> osd.145 chunk 85.33s0 > > >>>> so they can also start. > > >>> > > >>> Yes, please go ahead and do the same. > > >>> I expect that your PG 85.25 will go active as soon as both those OSDs > > >>> start correctly. > > >> > > >> Hmmm, unfortunately not. All OSDs are up, but 85.25 is still down. > > >> Its chunks are in: > > >> > > >> 85.25s0: osd.64 > > >> 85.25s1: osd.140 osd.159 > > >> 85.25s2: osd.96 > > >> 85.25s3: osd.121 osd.176 > > >> 85.25s4: osd.159 osd.56 > > >> > > >>> BTW, I also noticed in your crush map below that the down osds have > > >>> crush weight zero! > > >>> So -- this means they are the only active OSDs for a PG, and they are > > >>> all set to be drained. > > >>> How did this happen? It is also surely part of the root cause here! > > >>> > > >>> I suggest to reset the crush weight of those back to what it was > > >>> before, probably 1 ? > > >> > > >> At some point I changed those weight to 0., but this was well after the > > >> beginning of the problem: this helped, at least, healing a lot of > > >> degraded/undersized. > > >> > > >>> After you have all the PGs active, we need to find out why their "up" > > >>> set is completely bogus. > > >>> This is evidence that your crush rule is broken. > > >>> If a PG doesn't have an complete "up" set, then it can never not be > > >>> degraded -- the PGs don't know where to go. > > >> > > >> Do you think the choose-chooseleaf issue mentioned above, could be the > > >> culprit? > > >> > > >>> I'm curious about that "storage" type you guys invented. > > >> > > >> Oh, nothing too fancy... foreword, we happen to be using (and are > > >> currently finally replacing) hardware (based on FiberChannel-SAN) which > > >> is not the first choice in the Ceph world: but purchase happened before > > >> we turned to Ceph as our storage solution. Each OSD server has access to > > >> 2 such distinct storage systems, hence the idea to describe these > > >> failure domains in the crush rule. > > >> > > >>> Could you please copy to pastebin and share the crush.txt from > > >>> > > >>> ceph osd getcrushmap -o crush.map > > >>> crushtool -d crush.map -o crush.txt > > >> > > >> Here it is: > > >> https://pastebin.ubuntu.com/p/THkcT6xNgC/ > > >> > > >>>> Sure! Here it is. For historical reasons there are buckets of type > > >>>> "storage" which however you can safely ignore as they are no longer > > >>>> present in any crush_rule. > > >>> > > >>> I think they may be relevant, as mentioned earlier. > > >>> > > >>>> Please also don't worry about the funny weights, as I am preparing for > > >>>> hardware replacemente and am freeing up space. > > >>> > > >>> As a general rule, never drain osds (never decrease their crush > > >>> weight) when any PG is degraded. > > >>> You risk deleting the last copy of a PG! > > >> > > >> -- > > >> Fulvio Galeazzi > > >> GARR-CSD Department > > >> tel.: +39-334-6533-250 > > >> skype: fgaleazzi70 > > > > -- > > Fulvio Galeazzi > > GARR-CSD Department > > tel.: +39-334-6533-250 > > skype: fgaleazzi70 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx