Need help : MDS cluster completely dead !

ukernel@xxxxxxxxx (Yan, Zheng) · Thu, 4 Sep 2014 15:13:53 +0800

which version of MDS are you using?

On Wed, Sep 3, 2014 at 10:48 PM, Florent Bautista <florent at coppint.com> wrote:
> Hi John and thank you for your answer.
>
> I "solved" the problem doing : ceph mds stop 1
>
> So one MDS is marked as "stopping". A few hours later, it is still
> "stopping" (active process, consuming CPU sometimes).
>
> So the other seems to respond fine to clients...
>
> Multi-MDS is really really really unstable :-D
>
> On 09/03/2014 04:00 PM, John Spray wrote:
>> Hi Florent,
>>
>> The first thing to do is to turn up the logging on the MDS (if you
>> haven't already) -- set "debug mds = 20"
>> http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/#subsystem-log-and-debug-settings
>>
>> Since you say they appear as 'active' in "ceph status", I assume they
>> are running rather than crashing again, but it would be good to log
>> into the MDS servers and check that there really are running ceph-mds
>> processes.  If the MDS daemons are running but apparently
>> unresponsive, you may be able to get a little bit of extra info from
>> the running MDS by doing "ceph daemon mds.<name> <command>", where
>> interesting commands are dump_ops_in_flight, status, objecter_ops
>>
>> Hopefully that will give us some clues.
>>
>> Cheers,
>> John
>>
>> On Wed, Sep 3, 2014 at 11:52 AM, Florent Bautista
>> <bautista.florent at gmail.com> wrote:
>>> Hi everyone,
>>>
>>> I use Ceph Firefly release.
>>>
>>> I had a MDS cluster with only one MDS until yesterday, when I tried to add a
>>> second one to test multi-mds. I thought I could get back to one MDS when I
>>> want, but it seems we can't !
>>>
>>> Both crashed this night, and I am unable to get them back today.
>>>
>>> They appear as active in ceph -s, clients using 3.16 kernel mount it but no
>>> operation can be done : "ls" is freezing, load average of client is climbing
>>> and nothing is done by MDSes (not using CPU, nothing in logs except some
>>> "mdsload" messages and after some time : closing stale session client).
>>>
>>> How can I do to debug this situation and recover my data ?
>>>
>>> Thank you a lot.
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com