Re: Ceph Pacific mon is not starting after host reboot

Adam King <adking@xxxxxxxxxx> · Mon, 9 Aug 2021 14:44:33 -0400

Wanted to respond to the original thread I saw archived on this topic but I
wasn't subscribed to the mailing list yet so don't have the thread in my
inbox to reply to. Hopefully, those involved in that thread still see this.

This issue looks the same as https://tracker.ceph.com/issues/51027 which is
being worked on. Essentially, it seems that hosts that were being rebooted
were temporarily marked as offline and cephamd had an issue where it would
try to remove all daemons (outside of osds I believe) from offline hosts.
The pre-remove step for monitors was to remove it from the monmap, so this
would happen, but then the daemon itself would not be removed since the
host was temporarily inaccessible due to the reboot. When the host came
back up, the mon was restarted but it had already been removed from the
monmap so it gets stuck in a "stopped" state. A fix for this that stops
cephadm from trying to remove daemons from offline hosts is in the works.

A temporary workaround right now, as mentioned by Harry on that tracker, is
to get cephadm to actually remove the mon daemon by changing the placement
spec to not include the host with the broken mon. Then wait to see the mon
daemon was removed, and finally put the placement spec back to how it was
so the mon gets redeployed (and now hopefully runs normally).
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx