Need help : MDS cluster completely dead !

john.spray@xxxxxxxxxx (John Spray) · Wed, 3 Sep 2014 15:00:53 +0100

Hi Florent,

The first thing to do is to turn up the logging on the MDS (if you
haven't already) -- set "debug mds = 20"
http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/#subsystem-log-and-debug-settings

Since you say they appear as 'active' in "ceph status", I assume they
are running rather than crashing again, but it would be good to log
into the MDS servers and check that there really are running ceph-mds
processes.  If the MDS daemons are running but apparently
unresponsive, you may be able to get a little bit of extra info from
the running MDS by doing "ceph daemon mds.<name> <command>", where
interesting commands are dump_ops_in_flight, status, objecter_ops

Hopefully that will give us some clues.

Cheers,
John

On Wed, Sep 3, 2014 at 11:52 AM, Florent Bautista
<bautista.florent at gmail.com> wrote:
> Hi everyone,
>
> I use Ceph Firefly release.
>
> I had a MDS cluster with only one MDS until yesterday, when I tried to add a
> second one to test multi-mds. I thought I could get back to one MDS when I
> want, but it seems we can't !
>
> Both crashed this night, and I am unable to get them back today.
>
> They appear as active in ceph -s, clients using 3.16 kernel mount it but no
> operation can be done : "ls" is freezing, load average of client is climbing
> and nothing is done by MDSes (not using CPU, nothing in logs except some
> "mdsload" messages and after some time : closing stale session client).
>
> How can I do to debug this situation and recover my data ?
>
> Thank you a lot.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>