Excellent news! After everything is back to active+clean, don't forget to set min_size to 4 :) have a nice day On Mon, Apr 4, 2022 at 10:59 AM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote: > > Yesss! Fixing the choose/chooseleaf thing did make the magic. :-) > > Thanks a lot for your support Dan. Lots of lessons learned from my > side, I'm really grateful. > > All PGs are now active, will let Ceph rebalance. > > Ciao ciao > > Fulvio > > On 4/4/22 10:50, Dan van der Ster wrote: > > Hi Fulvio, > > > > Yes -- that choose/chooseleaf thing is definitely a problem.. Good catch! > > I suggest to fix it and inject the new crush map and see how it goes. > > > > > > Next, in your crush map for the storage type, you have an error: > > > > # types > > type 0 osd > > type 1 host > > type 2 chassis > > type 3 rack > > type 4 row > > type 5 pdu > > type 6 pod > > type 7 room > > type 8 datacenter > > type 9 region > > type 10 root > > type 11 storage > > > > The *order* of types is very important in crush -- they must be nested > > in the order they appear in the tree. "storage" should therefore be > > something between host and osd. > > If not, and if you use that type, it can break things. > > But since you're not actually using "storage" at the moment, it > > probably isn't causing any issue. > > > > So -- could you go ahead with that chooseleaf fix then let us know how it goes? > > > > Cheers, Dan > > > > > > > > > > > > On Mon, Apr 4, 2022 at 10:01 AM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote: > >> > >> Hi again Dan! > >> Things are improving, all OSDs are up, but still that one PG is down. > >> More info below. > >> > >> On 4/1/22 19:26, Dan van der Ster wrote: > >>>>>> Here is the output of "pg 85.12 query": > >>>>>> https://pastebin.ubuntu.com/p/ww3JdwDXVd/ > >>>>>> and its status (also showing the other 85.XX, for reference): > >>>>> > >>>>> This is very weird: > >>>>> > >>>>> "up": [ > >>>>> 2147483647, > >>>>> 2147483647, > >>>>> 2147483647, > >>>>> 2147483647, > >>>>> 2147483647 > >>>>> ], > >>>>> "acting": [ > >>>>> 67, > >>>>> 91, > >>>>> 82, > >>>>> 2147483647, > >>>>> 112 > >>>>> ], > >> > >> Meanwhile, since a random PG still shows an output like the above one, I > >> think I found the problem with the crush rule: it syas "choose" rather > >> than "chooseleaf"! > >> > >> rule csd-data-pool { > >> id 5 > >> type erasure > >> min_size 3 > >> max_size 5 > >> step set_chooseleaf_tries 5 > >> step set_choose_tries 100 > >> step take default class big > >> step choose indep 0 type host <--- HERE! > >> step emit > >> } > >> > >> ...relic of a more complicated, two-step rule... sigh! > >> > >>> PGs are active if at least 3 shards are up. > >>> Our immediate goal remains to get 3 shards up for PG 85.25 (I'm > >>> assuming 85.25 remains the one and only PG which is down?) > >> > >> Yes, 85.25 is still the single 'down' PG. > >> > >>>> pool 85 'csd-dataonly-ec-pool' erasure size 5 min_size 3 crush_rule 5 > >>>> object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn > >>>> last_change 616460 flags > >>>> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 12288 > >>>> application rbd > >>> > >>> Yup okay, we need to fix that later to make this cluster correctly > >>> configured. To be followed up. > >> > >> At some point, need to update min_size to 4. > >> > >>>> If I understand correctly, it should now be safe (but I will wait for > >>>> your green light) to repeat the same for: > >>>> osd.121 chunk 85.11s0 > >>>> osd.145 chunk 85.33s0 > >>>> so they can also start. > >>> > >>> Yes, please go ahead and do the same. > >>> I expect that your PG 85.25 will go active as soon as both those OSDs > >>> start correctly. > >> > >> Hmmm, unfortunately not. All OSDs are up, but 85.25 is still down. > >> Its chunks are in: > >> > >> 85.25s0: osd.64 > >> 85.25s1: osd.140 osd.159 > >> 85.25s2: osd.96 > >> 85.25s3: osd.121 osd.176 > >> 85.25s4: osd.159 osd.56 > >> > >>> BTW, I also noticed in your crush map below that the down osds have > >>> crush weight zero! > >>> So -- this means they are the only active OSDs for a PG, and they are > >>> all set to be drained. > >>> How did this happen? It is also surely part of the root cause here! > >>> > >>> I suggest to reset the crush weight of those back to what it was > >>> before, probably 1 ? > >> > >> At some point I changed those weight to 0., but this was well after the > >> beginning of the problem: this helped, at least, healing a lot of > >> degraded/undersized. > >> > >>> After you have all the PGs active, we need to find out why their "up" > >>> set is completely bogus. > >>> This is evidence that your crush rule is broken. > >>> If a PG doesn't have an complete "up" set, then it can never not be > >>> degraded -- the PGs don't know where to go. > >> > >> Do you think the choose-chooseleaf issue mentioned above, could be the > >> culprit? > >> > >>> I'm curious about that "storage" type you guys invented. > >> > >> Oh, nothing too fancy... foreword, we happen to be using (and are > >> currently finally replacing) hardware (based on FiberChannel-SAN) which > >> is not the first choice in the Ceph world: but purchase happened before > >> we turned to Ceph as our storage solution. Each OSD server has access to > >> 2 such distinct storage systems, hence the idea to describe these > >> failure domains in the crush rule. > >> > >>> Could you please copy to pastebin and share the crush.txt from > >>> > >>> ceph osd getcrushmap -o crush.map > >>> crushtool -d crush.map -o crush.txt > >> > >> Here it is: > >> https://pastebin.ubuntu.com/p/THkcT6xNgC/ > >> > >>>> Sure! Here it is. For historical reasons there are buckets of type > >>>> "storage" which however you can safely ignore as they are no longer > >>>> present in any crush_rule. > >>> > >>> I think they may be relevant, as mentioned earlier. > >>> > >>>> Please also don't worry about the funny weights, as I am preparing for > >>>> hardware replacemente and am freeing up space. > >>> > >>> As a general rule, never drain osds (never decrease their crush > >>> weight) when any PG is degraded. > >>> You risk deleting the last copy of a PG! > >> > >> -- > >> Fulvio Galeazzi > >> GARR-CSD Department > >> tel.: +39-334-6533-250 > >> skype: fgaleazzi70 > > -- > Fulvio Galeazzi > GARR-CSD Department > tel.: +39-334-6533-250 > skype: fgaleazzi70 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx