Re: Power Cycle Problems

Kyle Bader <kyle.bader@xxxxxxxxx> · Thu, 16 Jan 2014 21:21:05 -0800

> On two separate occasions I have lost power to my Ceph cluster. Both times, I had trouble bringing the cluster back to good health. I am wondering if I need to config something that would solve this problem?

No special configuration should be necessary, I've had the unfortunate
luck of witnessing several power loss events with large Ceph clusters.
In both cases something other than Ceph was the source of frustrations
once power was returned. That said, monitor daemons should be started
first and must form a quorum before the cluster will be usable. It
sounds like you have made it that far if your getting output from
"ceph health" commands. The next step is to get your Ceph OSD daemons
running, which will require the data partitions to be mounted and the
journal device present. In Ubuntu installations this is handled by
udev scripts installed by the Ceph packages (I think this is may be
true for RHEL/CentOS but have not verified). Short of the udev method
you can mount the data partition manually. Once the data partition is
mounted you can start the OSDs manually in the event that init still
doesn't work after mounting, to do so you will need to know the
location of your keyring, ceph.conf and the OSD id. If you are unsure
of what the OSD id is then you can look at the root of the OSD data
partition, after it is mounted, in a file named "whoami". To manually
start:

/usr/bin/ceph-osd -i ${OSD_ID} --pid-file
/var/run/ceph/osd.${OSD_ID}.pid -c /etc/ceph/ceph.conf

After that it's a matter of examining the logs if your still having
issues getting the OSDs to boot.

> After powering back up the cluster, “ceph health” revealed stale pages, mds cluster degraded, 3/3 OSDs down. I tried to issue “sudo /etc/init.d/ceph -a start” but I got no output from the command and the health status did not change.

The placement groups are stale because none of the OSDs have reported
their state recently since they are down.

> I ended up having to re-install the cluster to fix the issue, but as my group wants to use Ceph for VM storage in the future, we need to find a solution.

That's a shame, but at least you will be better prepared if it happens
again, hopefully your luck is not as unfortunate as mine!

-- 

Kyle Bader
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com