Failed monitors

george.ryall@xxxxxxxxxx (george.ryall at stfc.ac.uk) · Wed, 16 Jul 2014 12:58:59 +0000

On Friday I managed to run a command I probably shouldn't and knock half our OSDs offline. By setting the noout and nodown flags and bringing up the OSDS on the boxes that don't also have mons running on them I got most of the cluster back up by today (it took me a while to discover the nodown flag). However along the way I had to restart the mon service a few times and  in two cases the monitors didn't seem to be allowed to re-join the cluster and I reinstalled the monitor service on them manually. Then this morning I am getting the error message I associate with the mons being down whenever I try and run commands on the cluster. However, restarting the mon service on the three machines acting as monitors does not appear to help.

The message I get is:
2014-07-16 13:33:11.389331 7f6ba845b700  0 -- 130.246.179.122:0/1015725 >> 130.246.179.181:6789/0 pipe(0x7f6b98005f20 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6b980097d0).fault

What else can I try to bring the cluster back? What logs would it be useful for me to look at? Have I missed something?

George Ryall

Scientific Computing | STFC Rutherford Appleton Laboratory | Harwell Oxford | Didcot | OX11 0QX
(01235 44) 5021

-- 
Scanned by iCritical.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140716/e8711c64/attachment.htm>