I had a dual mds server configuration and have been copying data via cephfs kernel module to my cluster for the past 3 weeks and just had a MDS crash halting all IO. Leading up to the crash, I ran a test dd that increased the throughput by about 2x and stopped it but about 10 minutes later, the MDS server crashed and did not fail over to the standby properly. I have using an active/standby mds configuration but neither of the mds servers will stay running at this point and crash after starting them.
[bababurko@cephmon01 ~]$ sudo ceph -s
cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
health HEALTH_WARN
mds cluster is degraded
mds cephmds02 is laggy
noscrub,nodeep-scrub flag(s) set
monmap e1: 3 mons at {cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0}
election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03
mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)}
osdmap e324: 30 osds: 30 up, 30 in
flags noscrub,nodeep-scrub
pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects
14051 GB used, 13880 GB / 27931 GB avail
2112 active+clean
I am not sure what information is relevant so I will try to cover what I think is relevant based on posts I have read through:
Cluster:
running ceph-0.94.1 on CenttOS 7.1
[root@mdstest02 bababurko]$ uname -r
3.10.0-229.el7.x86_64
Here is my ceph-mds log with 'debug objector = 10' :
cat /sys/kernel/debug/ceph/*/mdsc output:
ceph.conf :
I have copied almost 5TB of small files to this cluster which has taken the better part of three weeks, so I am really hoping that there is a way to recover from this. This is ourPOC cluster
I'm sure I have missed something relevant as i'm just getting my mind back after nearly losing it, so feel free to ask for anything to assist.
Any help would be greatly appreciated.
thanks,
Bob
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com