Re: Ceph PGs stuck inactive after rebuild node

Eugen Block <eblock@xxxxxx> · Wed, 06 Apr 2022 17:20:03 +0000

Thanks for the comments, I'll get the log files to see if there's any  
hint. Getting the PGs in an active state is one thing, I'm sure  
multiple approaches would have worked. The main question is why this  
happens, we have 19 hosts to rebuild and can't risk the application  
outage everytime.

Was the PG stuck in the "activating" state? If so, I wonder if you  
temporarily exceeded mon_max_pg_per_osd on some OSDs when rebuilding  
your host. At least on Nautilus I've seen cases where Ceph doesn't  
gracefully recover from this temporary limit violation and the PGs  
need some nudges to become active.

I'm pretty sure that their cluster isn't anywhere near the limit for  
mon_max_pg_per_osd, they currently have around 100 PGs per OSD and the  
configs have not been touched, it's pretty basic. This cluster was  
upgraded from Luminous to Nautilus a few months ago.

Zitat von Anthony D'Atri <anthony.datri@xxxxxxxxx>:

Something worth a try before restarting an OSD in situations like this:

	ceph osd down 9

This marks the OSD down in the osdmap, but doesn’t touch the daemon.

Typically the subject OSD will see this and tell the mons “I’m not  
dead yet!” and repeer, which sometimes suffices to clear glitches.

Then I restarted OSD.9 and the inactive PG became active again.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx