mds server(s) crashed

Bob Ababurko <bob@xxxxxxxxxxxx> · Mon, 10 Aug 2015 18:21:05 -0700

I had a dual mds server configuration and have been copying data via cephfs kernel module to my cluster for the past 3 weeks and just had a MDS crash halting all IO.  Leading up to the crash, I ran a test dd that increased the throughput by about 2x and stopped it but about 10 minutes later, the MDS server crashed and did not fail over to the standby properly. I have using an active/standby mds configuration but neither of the mds servers will stay running at this point and crash after starting them.
[bababurko@cephmon01 ~]$ sudo ceph -s
    cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
     health HEALTH_WARN
            mds cluster is degraded
            mds cephmds02 is laggy
            noscrub,nodeep-scrub flag(s) set
     monmap e1: 3 mons at {cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0}
            election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03
     mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)}
     osdmap e324: 30 osds: 30 up, 30 in
            flags noscrub,nodeep-scrub
      pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects
            14051 GB used, 13880 GB / 27931 GB avail
                2112 active+clean

I am not sure what information is relevant so I will try to cover what I think is relevant based on posts I have read through:

Cluster:
running ceph-0.94.1 on CenttOS 7.1
[root@mdstest02 bababurko]$ uname -r
3.10.0-229.el7.x86_64

Here is my ceph-mds log with 'debug objector = 10' :

https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ=

cat /sys/kernel/debug/ceph/*/mdsc output:

https://www.zerobin.net/?ed238ce77b20583d#CK7Yt6yC1VgHfDee7y/CGkFh5bfyLkhwZB6i5R6N/8g=

ceph.conf :

https://www.zerobin.net/?62a125349aa43c92#5VH3XRR4P7zjhBHNWmTHrFYmwE0TZEig6r2EU6X1q/U=

I have copied almost 5TB of small files to this cluster which has taken the better part of three weeks, so I am really hoping that there is a way to recover from this.  This is ourPOC cluster

I'm sure I have missed something relevant as i'm just getting my mind back after nearly losing it, so feel free to ask for anything to assist.

Any help would be greatly appreciated.

thanks,
Bob

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com