Suggested best practise for Ceph node online/offline?

joe.z.hewitt@xxxxxxxxx (Joe Hewitt) · Fri, 11 Jul 2014 00:04:03 +0800

Hi there
Recently I got a problem triggered by rebooting ceph nodes, which eventually
wound up by rebuilding from ground up. A too-long-don't-read question here
is: is there suggested best practices for online/offline ceph node? 

Following the official ceph doc, I set up a 4 node ceph (firefly) cluster in
our lab last week. It consists of 1 admin, 1 mon and 2 osd nodes. All 4
nodes are physical servers running Ubuntu 14.04, no virtual machines used.
The mds and the 3rd osd are actually on the monitor node. Everything looked
okay by then, 'ceph -w' gave me health_ok and I could observe all available
storage capacity. I was happily writing python code using ceph S3 API.

This Monday I apt-get upgraded all those 4 machines and then rebooted them.
Once all 4 were back online, ceph -w gave some errors about IP address of
monitor node and some error messages like "{timestamp} 7fb6456f2500  0 --
{monitor IP address}:0/2924 ...... pipe(0x5516270 sd=97 :0 s=1 pgs=0 cs=0
l=1c=0x5c54e20).fault"
(sorry didn't save exact error logs since I've rebuilt the cluster, my
mistake :-( ). 

Due to lab's policy, only DHCP is allowed, so I updated monitor IP address
in /etc/ceph/ceph.conf and tried to push config to all nodes but that didn't
work. Then I tried to restart ceph service on those nodes, no luck. I even
went to ceph-deploy purgedata approach, no luck again. Then I had to purge
all, restarted from zero. Again, I'm sorry no error msgs were saved, I was
just too frustrated.

Now I have a working cluster but I don't think I can afford redo it again.
So the question mentioned above: how shall I properly do maintenance work
without breaking my ceph cluster? Some procedure or commands I should issue
after rebooting? Thanks 

Br.
J Hewitt