Node reboot -- OSDs not "logging off" from cluster

Daniel Schneller <daniel.schneller@xxxxxxxxxxxxxxxx> · Tue, 30 Jun 2015 11:36:21 +0200

Hi!

We are seeing a strange - and problematic - behavior in our 0.94.1
cluster on Ubuntu 14.04.1. We have 5 nodes, 4 OSDs each.

When rebooting one of the nodes (e. g. for a kernel upgrade) the OSDs
do not seem to shut down correctly. Clients hang and ceph osd tree show
the OSDs of that node still up. Repeated runs of ceph osd tree show
them going down after a while. For instance, here OSD.7 is still up,
even though the machine is in the middle of the reboot cycle.

[C|root@control01]  ~ ➜  ceph osd tree
# id	weight	type name	up/down	reweight
-1	36.2	root default
-2	7.24		host node01
0	1.81			osd.0	up	1	
5	1.81			osd.5	up	1	
10	1.81			osd.10	up	1	
15	1.81			osd.15	up	1	
-3	7.24		host node02
1	1.81			osd.1	up	1	
6	1.81			osd.6	up	1	
11	1.81			osd.11	up	1	
16	1.81			osd.16	up	1	
-4	7.24		host node03
2	1.81			osd.2	down	1	
7	1.81			osd.7	up	1	
12	1.81			osd.12	down	1	
17	1.81			osd.17	down	1	
-5	7.24		host node04
3	1.81			osd.3	up	1	
8	1.81			osd.8	up	1	
13	1.81			osd.13	up	1	
18	1.81			osd.18	up	1	
-6	7.24		host node05
4	1.81			osd.4	up	1	
9	1.81			osd.9	up	1	
14	1.81			osd.14	up	1	
19	1.81			osd.19	up	1

So it seems, the services are either not shut down correctly when the
reboot begins, or they do not get enough time to actually let the
cluster know they are going away.

If I stop the OSDs on that node manually before the reboot, everything
works as expected and clients don't notice any interruptions.

[C|root@node03]  ~ ➜  service ceph-osd stop id=2
ceph-osd stop/waiting
[C|root@node03]  ~ ➜  service ceph-osd stop id=7
ceph-osd stop/waiting
[C|root@node03]  ~ ➜  service ceph-osd stop id=12
ceph-osd stop/waiting
[C|root@node03]  ~ ➜  service ceph-osd stop id=17
ceph-osd stop/waiting
[C|root@node03]  ~ ➜  reboot

The upstart file was not changed from the packaged version.
Interestingly, the same Ceph version on a different cluster does _not_
show this behaviour.

Any ideas as to what is causing this or how to diagnose this?

Cheers,
Daniel

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com