Hi Nick, On Tue, 25 Sep 2012, Nick Bartos wrote: > I need to figure out some way of determining when it's OK to safely > reboot a single node. I believe this involves making sure that at > least one other monitor is running and up to date, and all the PGs on > the local OSDs have up to date copies somewhere else in the cluster. > We're not concerned about MDS at this time, since we're not currently > using the POSIX filesystem. > > I recall having a verbal conversation with Sage on this topic, but > apparently I didn't take good notes or I can't find them. I do > remember the solution was somewhat complicated. Is there any sort of > straight forward 'ceph' command that can do this now? If there isn't > one, I think it would be really great if something like that could be > implemented. It would seem to be a common enough use case to have a > simple command which could tell the admin if rebooting the node would > render the cluster partially unusable. Making a conservative determination should be pretty straightforward. Something like: - make sure losing any local mon won't break quorum - make sure all PGs touching local osd(s) are active+clean and have other osds in the acting set should do the trick, as a first pass at least. This can all be done by analyzing the 'ceph pg dump --format=json', 'ceph osd dump --format=json', and 'ceph quorum_status'. The annoying part is just mapping ip addresses to osds and mons to figure out which ones are local... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html