Re: Ceph PGs stuck inactive after rebuild node

Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> · Thu, 7 Apr 2022 09:15:31 -0600

On Thu, Apr 7, 2022 at 12:15 AM Eugen Block <eblock@xxxxxx> wrote:
> Basically, these are the steps to remove all OSDs from that host (OSDs
> are not "replaced" so they aren't marked "destroyed") [1]:
>
> 1) Call 'ceph osd out $id'
> 2) Call systemctl stop ceph-osd@$id
> 3) ceph osd purge $id --yes-i-really-mean-it

Ah, this purge step would still be enough to potentially cause the problem.

To be clear, then, this is the sequence I'm proposing could have been
the problem:
1. All OSDs on the host are purged per above.
2. New OSDs are created.
3. As they come up, one by one, CRUSH starts to assign PGs to them.
Importantly, when the first OSD comes up, it gets a large number of
OSDs, exceeding mon_max_pg_per_osd. Thus, some of these PGs don't
activate.
4. As each of the remaining OSDs come up, CRUSH re-assigns some PGs to them.
5. Finally, all OSDs are up. However, any PGs that were stuck in
"activating" from step 3 that were _not_ reassigned to other OSDs are
still stuck in "activating", and need a repeer or OSD down/up cycle to
restart peering for them. (At least in Pacific, tweaking
mon_max_pg_per_osd also allows some of these PGs to make peering
progress.)

This assumes that the CRUSH rule in question leads to this sort of
behaviour. I would expect this more from a host-centric CRUSH rule
than from a rack-centric one, for example. Also note that it's a
matter of "luck" as to whether you'll actually see a problem as of
step 5, since if it so happens that all of the PGs stuck in
"activating" in step 3 get assigned, the final state of the cluster
will be fine (i.e. no inactive PGs).

Josh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx