Hi
I am running a small cluster of 8 machines (80 osds), with three monitors on Ubuntu 16.04. Ceph version 10.2.5.
I cannot reboot the monitors without physically going into the datacenter and power cycling them. What happens is that while shutting down, ceph gets stuck trying to contact the other monitors but networking has already shut down or something like that. I get an endless stream of:
libceph: connect 10.20.0.10:6789 error -101
libceph: connect 10.20.0.13:6789 error -101
libceph: connect 10.20.0.17:6789 error -101
where in this case 10.20.0.10 is the machine I am trying to shut down and all three IPs are the MONs.
At this stage of the shutdown, the machine doesn't respond to pings, and I cannot even log in on any of the virtual terminals. Nothing to do but poweroff at the server.
The other non-mon servers shut down just fine, and the cluster was healthy at the time I was rebooting the mon (I only reboot one machine at a time, waiting for it to come up before I do the next one).
Also worth mentioning that if I execute
sudo systemctl stop ceph\*.service ceph\*.target
on the server, the only things I see are:
root 11143 2 0 18:40 ? 00:00:00 [ceph-msgr]
root 11162 2 0 18:40 ? 00:00:00 [ceph-watch-noti]
and even then, when no ceph daemons are left running, doing a reboot goes into the same loop.
I can't really find any mention of this online, but I feel someone must have hit this. Any idea how to fix it? It's really annoying because its hard for me to get access to the datacenter.
Thanks
Michael
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com