Hi Fulvio, I've seen this in the past when a CRUSH change temporarily resulted in too many PGs being mapped to an OSD, exceeding mon_max_pg_per_osd. You can try increasing that setting to see if it helps, then setting it back to default once backfill completes. You may also need to "ceph pg repeer $pgid" for each of the PGs stuck activating. Josh On Thu, Sep 15, 2022 at 8:42 AM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote: > > > Hallo, > I am on Nautilus and today, after upgrading the operating system (from > CentOS 7 to CentOS 8 Stream) on a couple OSD servers and adding them > back to the cluster, I noticed some PGs are still "activating". > The upgraded server are from the same "rack", and I have replica-3 > pools with 1-per-rack rule, and 6+4 EC pools (in some cases, with SSD > pool for metadata). > > More details: > - on the two OSD servers I upgrade, I ran "systemctl stop ceph.target" > and waited a while, to verify all PGs would remain "active" > - went on with the upgrade and ceph-ansible reconfig > - as soon as I started adding OSDs I saw "slow ops" > - to exclude possible effect of updated packages, I ran "yum update" on > all OSD servers, and rebooted them one by one > - after 2-3 hours, the last OSD disks finally came up > - I am left with: > about 1k "slow ops" (if I pause recovery, number ~stable but max > age increasing) > ~200 inactive PGs > > Most of the inactive PGs are from the object store pool: > > [cephmgr@cephAdmCT1.cephAdmCT1 ~]$ ceph osd pool get > default.rgw.buckets.data crush_rule > crush_rule: default.rgw.buckets.data > > rule default.rgw.buckets.data { > id 6 > type erasure > min_size 3 > max_size 10 > step set_chooseleaf_tries 5 > step set_choose_tries 100 > step take default class big > step chooseleaf indep 0 type host > step emit > } > > But "ceph pg dump_stuck inactive" also shows 4 lines for the glance > replicated pool, like: > > 82.34 activating+remapped [139,50,207] 139 > [139,50,284] 139 > 82.54 activating+undersized+degraded+remapped [139,86,5] 139 > [139,74] 139 > > > Need your help please: > > - any idea what was the root cause for all this? > > - and now, how can I help OSDs complete their activation? > + does the procedure differ for EC or replicated pools, by the way? > + or may be I should first get rid of the "slow ops" issue? > > I am pasting: > ceph osd df tree > https://pastebin.ubuntu.com/p/VWhT7FWf6m/ > > ceph osd lspools ; ceph pg dump_stuck inactive > https://pastebin.ubuntu.com/p/9f6rXRYMh4/ > > Thanks a lot! > > Fulvio > > -- > Fulvio Galeazzi > GARR-CSD Department > tel.: +39-334-6533-250 > skype: fgaleazzi70 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx