Thanks for the comments, I'll get the log files to see if there's any
hint. Getting the PGs in an active state is one thing, I'm sure
multiple approaches would have worked. The main question is why this
happens, we have 19 hosts to rebuild and can't risk the application
outage everytime.
Was the PG stuck in the "activating" state? If so, I wonder if you
temporarily exceeded mon_max_pg_per_osd on some OSDs when rebuilding
your host. At least on Nautilus I've seen cases where Ceph doesn't
gracefully recover from this temporary limit violation and the PGs
need some nudges to become active.
I'm pretty sure that their cluster isn't anywhere near the limit for
mon_max_pg_per_osd, they currently have around 100 PGs per OSD and the
configs have not been touched, it's pretty basic. This cluster was
upgraded from Luminous to Nautilus a few months ago.
Zitat von Anthony D'Atri <anthony.datri@xxxxxxxxx>:
Something worth a try before restarting an OSD in situations like this:
ceph osd down 9
This marks the OSD down in the osdmap, but doesn’t touch the daemon.
Typically the subject OSD will see this and tell the mons “I’m not
dead yet!” and repeer, which sometimes suffices to clear glitches.
Then I restarted OSD.9 and the inactive PG became active again.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx