Cannot shutdown monitors

Michael Andersen <michael@xxxxxxxxxxxxx> · Fri, 10 Feb 2017 18:53:45 -0800

Hi
I am running a small cluster of 8 machines (80 osds), with three monitors on Ubuntu 16.04. Ceph version 10.2.5.

I cannot reboot the monitors without physically going into the datacenter and power cycling them. What happens is that while shutting down, ceph gets stuck trying to contact the other monitors but networking has already shut down or something like that. I get an endless stream of:

libceph: connect 10.20.0.10:6789 error -101
libceph: connect 10.20.0.13:6789 error -101
libceph: connect 10.20.0.17:6789 error -101

where in this case 10.20.0.10 is the machine I am trying to shut down and all three IPs are the MONs.

At this stage of the shutdown, the machine doesn't respond to pings, and I cannot even log in on any of the virtual terminals. Nothing to do but poweroff at the server.

The other non-mon servers shut down just fine, and the cluster was healthy at the time I was rebooting the mon (I only reboot one machine at a time, waiting for it to come up before I do the next one).

Also worth mentioning that if I execute 

sudo systemctl stop ceph\*.service ceph\*.target

on the server, the only things I see are:

root     11143     2  0 18:40 ?        00:00:00 [ceph-msgr]
root     11162     2  0 18:40 ?        00:00:00 [ceph-watch-noti]

and even then, when no ceph daemons are left running, doing a reboot goes into the same loop.

I can't really find any mention of this online, but I feel someone must have hit this. Any idea how to fix it? It's really annoying because its hard for me to get access to the datacenter.

Thanks
Michael
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com