----- Message from george.ryall at stfc.ac.uk --------- Date: Wed, 16 Jul 2014 14:45:35 +0000 From: george.ryall at stfc.ac.uk Subject: Re: Failed monitors To: ceph-users at lists.ceph.com > This now appears to have partially fixed itself. I am now able to > run commands on the cluster, though one of the monitors is down. I > still have no idea what was going on. Hi George, What do the logs /var/log/ceph/ceph-mon.*.log say? Kenneth > > > George > > From: george.ryall at stfc.ac.uk [mailto:george.ryall at stfc.ac.uk] > Sent: 16 July 2014 13:59 > To: ceph-users at lists.ceph.com > Subject: [ceph-users] Failed monitors > > On Friday I managed to run a command I probably shouldn't and knock > half our OSDs offline. By setting the noout and nodown flags and > bringing up the OSDS on the boxes that don't also have mons running > on them I got most of the cluster back up by today (it took me a > while to discover the nodown flag). However along the way I had to > restart the mon service a few times and in two cases the monitors > didn't seem to be allowed to re-join the cluster and I reinstalled > the monitor service on them manually. Then this morning I am getting > the error message I associate with the mons being down whenever I > try and run commands on the cluster. However, restarting the mon > service on the three machines acting as monitors does not appear to > help. > > The message I get is: > 2014-07-16 13:33:11.389331 7f6ba845b700 0 -- > 130.246.179.122:0/1015725 >> 130.246.179.181:6789/0 > pipe(0x7f6b98005f20 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6b980097d0).fault > > What else can I try to bring the cluster back? What logs would it be > useful for me to look at? Have I missed something? > > > George Ryall > > Scientific Computing | STFC Rutherford Appleton Laboratory | Harwell > Oxford | Didcot | OX11 0QX > (01235 44) 5021 > > > > -- > Scanned by iCritical. > > > -- > Scanned by iCritical. ----- End message from george.ryall at stfc.ac.uk ----- -- Met vriendelijke groeten, Kenneth Waegeman