On Tue, Aug 11, 2015 at 9:21 AM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote: > I had a dual mds server configuration and have been copying data via cephfs > kernel module to my cluster for the past 3 weeks and just had a MDS crash > halting all IO. Leading up to the crash, I ran a test dd that increased the > throughput by about 2x and stopped it but about 10 minutes later, the MDS > server crashed and did not fail over to the standby properly. I have using > an active/standby mds configuration but neither of the mds servers will stay > running at this point and crash after starting them. > > [bababurko@cephmon01 ~]$ sudo ceph -s > cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79 > health HEALTH_WARN > mds cluster is degraded > mds cephmds02 is laggy > noscrub,nodeep-scrub flag(s) set > monmap e1: 3 mons at > {cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0} > election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03 > mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)} > osdmap e324: 30 osds: 30 up, 30 in > flags noscrub,nodeep-scrub > pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects > 14051 GB used, 13880 GB / 27931 GB avail > 2112 active+clean > > > I am not sure what information is relevant so I will try to cover what I > think is relevant based on posts I have read through: > > Cluster: > running ceph-0.94.1 on CenttOS 7.1 > [root@mdstest02 bababurko]$ uname -r > 3.10.0-229.el7.x86_64 > > Here is my ceph-mds log with 'debug objector = 10' : > > https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ= could you use gdb to check where the crash happened. (gdb /usr/local/bin/ceph-mds /core.xxxxx. maybe you need re-compile mds with debuginfo) Yan, Zheng > > cat /sys/kernel/debug/ceph/*/mdsc output: > > https://www.zerobin.net/?ed238ce77b20583d#CK7Yt6yC1VgHfDee7y/CGkFh5bfyLkhwZB6i5R6N/8g= > > ceph.conf : > > https://www.zerobin.net/?62a125349aa43c92#5VH3XRR4P7zjhBHNWmTHrFYmwE0TZEig6r2EU6X1q/U= > > I have copied almost 5TB of small files to this cluster which has taken the > better part of three weeks, so I am really hoping that there is a way to > recover from this. This is ourPOC cluster > > I'm sure I have missed something relevant as i'm just getting my mind back > after nearly losing it, so feel free to ask for anything to assist. > > Any help would be greatly appreciated. > > thanks, > Bob > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com