> -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Gregory Farnum > Sent: 01 July 2015 16:56 > To: Daniel Schneller > Cc: ceph-users@xxxxxxxxxxxxxx > Subject: Re: Node reboot -- OSDs not "logging off" from cluster > > On Tue, Jun 30, 2015 at 10:36 AM, Daniel Schneller > <daniel.schneller@xxxxxxxxxxxxxxxx> wrote: > > Hi! > > > > We are seeing a strange - and problematic - behavior in our 0.94.1 > > cluster on Ubuntu 14.04.1. We have 5 nodes, 4 OSDs each. > > > > When rebooting one of the nodes (e. g. for a kernel upgrade) the OSDs > > do not seem to shut down correctly. Clients hang and ceph osd tree > > show the OSDs of that node still up. Repeated runs of ceph osd tree > > show them going down after a while. For instance, here OSD.7 is still > > up, even though the machine is in the middle of the reboot cycle. > > > > [C|root@control01] ~ ➜ ceph osd tree > > # id weight type name up/down reweight > > -1 36.2 root default > > -2 7.24 host node01 > > 0 1.81 osd.0 up 1 > > 5 1.81 osd.5 up 1 > > 10 1.81 osd.10 up 1 > > 15 1.81 osd.15 up 1 > > -3 7.24 host node02 > > 1 1.81 osd.1 up 1 > > 6 1.81 osd.6 up 1 > > 11 1.81 osd.11 up 1 > > 16 1.81 osd.16 up 1 > > -4 7.24 host node03 > > 2 1.81 osd.2 down 1 > > 7 1.81 osd.7 up 1 > > 12 1.81 osd.12 down 1 > > 17 1.81 osd.17 down 1 > > -5 7.24 host node04 > > 3 1.81 osd.3 up 1 > > 8 1.81 osd.8 up 1 > > 13 1.81 osd.13 up 1 > > 18 1.81 osd.18 up 1 > > -6 7.24 host node05 > > 4 1.81 osd.4 up 1 > > 9 1.81 osd.9 up 1 > > 14 1.81 osd.14 up 1 > > 19 1.81 osd.19 up 1 > > > > So it seems, the services are either not shut down correctly when the > > reboot begins, or they do not get enough time to actually let the > > cluster know they are going away. > > > > If I stop the OSDs on that node manually before the reboot, everything > > works as expected and clients don't notice any interruptions. > > > > [C|root@node03] ~ ➜ service ceph-osd stop id=2 ceph-osd stop/waiting > > [C|root@node03] ~ ➜ service ceph-osd stop id=7 ceph-osd stop/waiting > > [C|root@node03] ~ ➜ service ceph-osd stop id=12 ceph-osd > > stop/waiting [C|root@node03] ~ ➜ service ceph-osd stop id=17 > > ceph-osd stop/waiting [C|root@node03] ~ ➜ reboot > > > > The upstart file was not changed from the packaged version. > > Interestingly, the same Ceph version on a different cluster does _not_ > > show this behaviour. > > > > Any ideas as to what is causing this or how to diagnose this? Do you have the OSD's running on the same boxes as the monitors? > > I'm not sure why it would be happening, but: > * The OSDs send out shutdown messages to the monitor indicating they're > going away whenever they get shut down politely. There's a short timeout to > make sure they don't hang on you. > * The only way the OSD doesn't get marked down during reboot is if the > monitor doesn't get this message. > * If the monitor isn't getting the message, the OSD either isn't sending the > message or it's getting blocked. > > My guess is that for some reason the OSDs are getting the shutdown signal > after the networking goes away. > -Greg > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com