Another common config to workaround this pg num limit is: ceph config set osd osd_max_pg_per_osd_hard_ratio 10 (Then possibly the repeer step on each activating pg) .. Dan On Thu, Sept 15, 2022, 17:47 Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> wrote: > Hi Fulvio, > > I've seen this in the past when a CRUSH change temporarily resulted in > too many PGs being mapped to an OSD, exceeding mon_max_pg_per_osd. You > can try increasing that setting to see if it helps, then setting it > back to default once backfill completes. You may also need to "ceph pg > repeer $pgid" for each of the PGs stuck activating. > > Josh > > On Thu, Sep 15, 2022 at 8:42 AM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> > wrote: > > > > > > Hallo, > > I am on Nautilus and today, after upgrading the operating system > (from > > CentOS 7 to CentOS 8 Stream) on a couple OSD servers and adding them > > back to the cluster, I noticed some PGs are still "activating". > > The upgraded server are from the same "rack", and I have replica-3 > > pools with 1-per-rack rule, and 6+4 EC pools (in some cases, with SSD > > pool for metadata). > > > > More details: > > - on the two OSD servers I upgrade, I ran "systemctl stop ceph.target" > > and waited a while, to verify all PGs would remain "active" > > - went on with the upgrade and ceph-ansible reconfig > > - as soon as I started adding OSDs I saw "slow ops" > > - to exclude possible effect of updated packages, I ran "yum update" on > > all OSD servers, and rebooted them one by one > > - after 2-3 hours, the last OSD disks finally came up > > - I am left with: > > about 1k "slow ops" (if I pause recovery, number ~stable but max > > age increasing) > > ~200 inactive PGs > > > > Most of the inactive PGs are from the object store pool: > > > > [cephmgr@cephAdmCT1.cephAdmCT1 ~]$ ceph osd pool get > > default.rgw.buckets.data crush_rule > > crush_rule: default.rgw.buckets.data > > > > rule default.rgw.buckets.data { > > id 6 > > type erasure > > min_size 3 > > max_size 10 > > step set_chooseleaf_tries 5 > > step set_choose_tries 100 > > step take default class big > > step chooseleaf indep 0 type host > > step emit > > } > > > > But "ceph pg dump_stuck inactive" also shows 4 lines for the glance > > replicated pool, like: > > > > 82.34 activating+remapped [139,50,207] 139 > > [139,50,284] 139 > > 82.54 activating+undersized+degraded+remapped [139,86,5] 139 > > [139,74] 139 > > > > > > Need your help please: > > > > - any idea what was the root cause for all this? > > > > - and now, how can I help OSDs complete their activation? > > + does the procedure differ for EC or replicated pools, by the way? > > + or may be I should first get rid of the "slow ops" issue? > > > > I am pasting: > > ceph osd df tree > > https://pastebin.ubuntu.com/p/VWhT7FWf6m/ > > > > ceph osd lspools ; ceph pg dump_stuck inactive > > https://pastebin.ubuntu.com/p/9f6rXRYMh4/ > > > > Thanks a lot! > > > > Fulvio > > > > -- > > Fulvio Galeazzi > > GARR-CSD Department > > tel.: +39-334-6533-250 > > skype: fgaleazzi70 > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx