On Thu, Apr 7, 2022 at 12:15 AM Eugen Block <eblock@xxxxxx> wrote: > Basically, these are the steps to remove all OSDs from that host (OSDs > are not "replaced" so they aren't marked "destroyed") [1]: > > 1) Call 'ceph osd out $id' > 2) Call systemctl stop ceph-osd@$id > 3) ceph osd purge $id --yes-i-really-mean-it Ah, this purge step would still be enough to potentially cause the problem. To be clear, then, this is the sequence I'm proposing could have been the problem: 1. All OSDs on the host are purged per above. 2. New OSDs are created. 3. As they come up, one by one, CRUSH starts to assign PGs to them. Importantly, when the first OSD comes up, it gets a large number of OSDs, exceeding mon_max_pg_per_osd. Thus, some of these PGs don't activate. 4. As each of the remaining OSDs come up, CRUSH re-assigns some PGs to them. 5. Finally, all OSDs are up. However, any PGs that were stuck in "activating" from step 3 that were _not_ reassigned to other OSDs are still stuck in "activating", and need a repeer or OSD down/up cycle to restart peering for them. (At least in Pacific, tweaking mon_max_pg_per_osd also allows some of these PGs to make peering progress.) This assumes that the CRUSH rule in question leads to this sort of behaviour. I would expect this more from a host-centric CRUSH rule than from a rack-centric one, for example. Also note that it's a matter of "luck" as to whether you'll actually see a problem as of step 5, since if it so happens that all of the PGs stuck in "activating" in step 3 get assigned, the final state of the cluster will be fine (i.e. no inactive PGs). Josh _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx