On Tue, Aug 11, 2015 at 2:21 AM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote: > I had a dual mds server configuration and have been copying data via cephfs > kernel module to my cluster for the past 3 weeks and just had a MDS crash > halting all IO. Leading up to the crash, I ran a test dd that increased the > throughput by about 2x and stopped it but about 10 minutes later, the MDS > server crashed and did not fail over to the standby properly. I have using > an active/standby mds configuration but neither of the mds servers will stay > running at this point and crash after starting them. > > [bababurko@cephmon01 ~]$ sudo ceph -s > cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79 > health HEALTH_WARN > mds cluster is degraded > mds cephmds02 is laggy > noscrub,nodeep-scrub flag(s) set > monmap e1: 3 mons at > {cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0} > election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03 > mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)} > osdmap e324: 30 osds: 30 up, 30 in > flags noscrub,nodeep-scrub > pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects > 14051 GB used, 13880 GB / 27931 GB avail > 2112 active+clean > > > I am not sure what information is relevant so I will try to cover what I > think is relevant based on posts I have read through: > > Cluster: > running ceph-0.94.1 on CenttOS 7.1 > [root@mdstest02 bababurko]$ uname -r > 3.10.0-229.el7.x86_64 > > Here is my ceph-mds log with 'debug objector = 10' : > > https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ= Ouch! Unfortunately all we can tell from this is that we're hitting an assertion somewhere while loading a directory fragment from disk. As Zheng says, you'll need to drill a bit deeper. If you were installing from packages you may find ceph-debuginfo useful. In addition to getting us a clearer stack trace with debug symbols, please also crank "debug mds" up to 20 (this is massively verbose so hopefully it doesn't take too long to reproduce the issue). Hopefully this is fairly straightforward to reproduce. If it's something fundamentally malformed on disk then just doing a recursive ls on the filesystem would trigger it, at least. Cheers, John _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com