Re: Cannot shutdown monitors

Michael Andersen <michael@xxxxxxxxxxxxx> · Fri, 10 Feb 2017 21:31:49 -0800

I definitely had all the rbd volumes unmounted. I am not sure if they were unmapped. I can try that.

On Fri, Feb 10, 2017 at 9:10 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
On Sat, Feb 11, 2017 at 2:58 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:

> Just making sure the list sees this for those that are following.

>

> On Sat, Feb 11, 2017 at 2:49 PM, Michael Andersen <michael@xxxxxxxxxxxxx> wrote:

>> Right, so yes libceph is loaded

>>

>> root@compound-7:~# lsmod | egrep "ceph|rbd"

>> rbd                    69632  0

>> libceph               245760  1 rbd

>> libcrc32c              16384  3 xfs,raid456,libceph

>>

>> I stopped all the services and unloaded the modules

>>

>> root@compound-7:~# systemctl stop ceph\*.service ceph\*.target

>> root@compound-7:~# modprobe -r rbd

>> root@compound-7:~# modprobe -r libceph

>> root@compound-7:~# lsmod | egrep "ceph|rbd"

>>

>> Then rebooted

>> root@compound-7:~# reboot

>>

>> And sure enough the reboot happened OK.

>>

>> So that solves my immediate problem, I now know how to work around it

>> (thanks!), but I would love to work out how to not need this step. Any

Can you double-check that all rbd volumes are unmounted on this host

when shutting down? Maybe unmap them just for good measure.

I don't believe the libceph module should need to talk to the cluster

unless it has active connections at the time of shutdown.

>> further info I can give to help?

>>

>>

>>

>> On Fri, Feb 10, 2017 at 8:42 PM, Michael Andersen <michael@xxxxxxxxxxxxx>

>> wrote:

>>>

>>> Sorry this email arrived out of order. I will do the modprobe -r test

>>>

>>> On Fri, Feb 10, 2017 at 8:20 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:

>>>>

>>>> On Sat, Feb 11, 2017 at 2:08 PM, Michael Andersen <michael@xxxxxxxxxxxxx>

>>>> wrote:

>>>> > I believe I did shutdown mon process. Is that not done by the

>>>> >

>>>> > sudo systemctl stop ceph\*.service ceph\*.target

>>>> >

>>>> > command? Also, as I noted, the mon process does not show up in ps after

>>>> > I do

>>>> > that, but I still get the shutdown halting.

>>>> >

>>>> > The libceph kernel module may be installed. I did not do so

>>>> > deliberately but

>>>> > I used ceph-deploy so if it installs that then that is why it's there.

>>>> > I

>>>> > also run some kubernetes pods with rbd persistent volumes on these

>>>> > machines,

>>>> > although no rbd volumes are in use or mounted when I try shut down. In

>>>> > fact

>>>> > I unmapped all rbd volumes across the whole cluster to make sure. Is

>>>> > libceph

>>>> > required for rbd?

>>>>

>>>> For kernel rbd (/dev/rbd0, etc.) yes, for librbd, no.

>>>>

>>>> As a test try modprobe -r on both the libceph and rbd modules before

>>>> shutdown and see if that helps ("modprobe -r rbd" should unload

>>>> libceph as well but verify that).

>>>>

>>>> >

>>>> > But even so, is it normal for the libceph kernel module to prevent

>>>> > shutdown?

>>>> > Is there another stage in the shutdown procedure that I am missing?

>>>> >

>>>> >

>>>> > On Feb 10, 2017 7:49 PM, "Brad Hubbard" <bhubbard@xxxxxxxxxx> wrote:

>>>> >

>>>> > That looks like dmesg output from the libceph kernel module. Do you

>>>> > have the libceph kernel module loaded?

>>>> >

>>>> > If the answer to that question is "yes" the follow-up question is

>>>> > "Why?" as it is not required for a MON or OSD host.

>>>> >

>>>> > On Sat, Feb 11, 2017 at 1:18 PM, Michael Andersen

>>>> > <michael@xxxxxxxxxxxxx>

>>>> > wrote:

>>>> >> Yeah, all three mons have OSDs on the same machines.

>>>> >>

>>>> >> On Feb 10, 2017 7:13 PM, "Shinobu Kinjo" <skinjo@xxxxxxxxxx> wrote:

>>>> >>>

>>>> >>> Is your primary MON running on the host which some OSDs are running

>>>> >>> on?

>>>> >>>

>>>> >>> On Sat, Feb 11, 2017 at 11:53 AM, Michael Andersen

>>>> >>> <michael@xxxxxxxxxxxxx> wrote:

>>>> >>> > Hi

>>>> >>> >

>>>> >>> > I am running a small cluster of 8 machines (80 osds), with three

>>>> >>> > monitors on

>>>> >>> > Ubuntu 16.04. Ceph version 10.2.5.

>>>> >>> >

>>>> >>> > I cannot reboot the monitors without physically going into the

>>>> >>> > datacenter

>>>> >>> > and power cycling them. What happens is that while shutting down,

>>>> >>> > ceph

>>>> >>> > gets

>>>> >>> > stuck trying to contact the other monitors but networking has

>>>> >>> > already

>>>> >>> > shut

>>>> >>> > down or something like that. I get an endless stream of:

>>>> >>> >

>>>> >>> > libceph: connect 10.20.0.10:6789 error -101

>>>> >>> > libceph: connect 10.20.0.13:6789 error -101

>>>> >>> > libceph: connect 10.20.0.17:6789 error -101

>>>> >>> >

>>>> >>> > where in this case 10.20.0.10 is the machine I am trying to shut

>>>> >>> > down

>>>> >>> > and

>>>> >>> > all three IPs are the MONs.

>>>> >>> >

>>>> >>> > At this stage of the shutdown, the machine doesn't respond to

>>>> >>> > pings,

>>>> >>> > and

>>>> >>> > I

>>>> >>> > cannot even log in on any of the virtual terminals. Nothing to do

>>>> >>> > but

>>>> >>> > poweroff at the server.

>>>> >>> >

>>>> >>> > The other non-mon servers shut down just fine, and the cluster was

>>>> >>> > healthy

>>>> >>> > at the time I was rebooting the mon (I only reboot one machine at a

>>>> >>> > time,

>>>> >>> > waiting for it to come up before I do the next one).

>>>> >>> >

>>>> >>> > Also worth mentioning that if I execute

>>>> >>> >

>>>> >>> > sudo systemctl stop ceph\*.service ceph\*.target

>>>> >>> >

>>>> >>> > on the server, the only things I see are:

>>>> >>> >

>>>> >>> > root     11143     2  0 18:40 ?        00:00:00 [ceph-msgr]

>>>> >>> > root     11162     2  0 18:40 ?        00:00:00 [ceph-watch-noti]

>>>> >>> >

>>>> >>> > and even then, when no ceph daemons are left running, doing a

>>>> >>> > reboot

>>>> >>> > goes

>>>> >>> > into the same loop.

>>>> >>> >

>>>> >>> > I can't really find any mention of this online, but I feel someone

>>>> >>> > must

>>>> >>> > have

>>>> >>> > hit this. Any idea how to fix it? It's really annoying because its

>>>> >>> > hard

>>>> >>> > for

>>>> >>> > me to get access to the datacenter.

>>>> >>> >

>>>> >>> > Thanks

>>>> >>> > Michael

>>>> >>> >

>>>> >>> > _______________________________________________

>>>> >>> > ceph-users mailing list

>>>> >>> > ceph-users@xxxxxxxxxxxxxx

>>>> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>> >>> >

>>>> >>

>>>> >>

>>>> >> _______________________________________________

>>>> >> ceph-users mailing list

>>>> >> ceph-users@xxxxxxxxxxxxxx

>>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>> >>

>>>> >

>>>> >

>>>> >

>>>> > --

>>>> > Cheers,

>>>> > Brad

>>>> >

>>>> >

>>>>

>>>>

>>>>

>>>> --

>>>> Cheers,

>>>> Brad

>>>

>>>

>>

>

>

>

> --

> Cheers,

> Brad

--

Cheers,

Brad

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com