Re: mds server(s) crashed

Bob Ababurko <bob@xxxxxxxxxxxx> · Tue, 11 Aug 2015 10:43:34 -0700

Yes, this was a package install and ceph-debuginfo was used and hopefully the output of the backtrace is useful.
I thought it was interesting that you mentioned reproduce with an ls because aside from me doing a large dd before this issue surfaced, your post made me recall that I also ran ls a few times to drill down and eventually list the files that are located two subdirectories down around the same time.  I also recall for a moment that I found it strange that I got results back so quickly because our netapp takes forever to do this....it was so quick, that in retrospect, the list of files may not have been complete.  I regret not following up that thought.

On Tue, Aug 11, 2015 at 1:52 AM, John Spray <jspray@xxxxxxxxxx> wrote:
On Tue, Aug 11, 2015 at 2:21 AM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote:

> I had a dual mds server configuration and have been copying data via cephfs

> kernel module to my cluster for the past 3 weeks and just had a MDS crash

> halting all IO.  Leading up to the crash, I ran a test dd that increased the

> throughput by about 2x and stopped it but about 10 minutes later, the MDS

> server crashed and did not fail over to the standby properly. I have using

> an active/standby mds configuration but neither of the mds servers will stay

> running at this point and crash after starting them.

>

> [bababurko@cephmon01 ~]$ sudo ceph -s

>     cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79

>      health HEALTH_WARN

>             mds cluster is degraded

>             mds cephmds02 is laggy

>             noscrub,nodeep-scrub flag(s) set

>      monmap e1: 3 mons at

> {cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0}

>             election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03

>      mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)}

>      osdmap e324: 30 osds: 30 up, 30 in

>             flags noscrub,nodeep-scrub

>       pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects

>             14051 GB used, 13880 GB / 27931 GB avail

>                 2112 active+clean

>

>

> I am not sure what information is relevant so I will try to cover what I

> think is relevant based on posts I have read through:

>

> Cluster:

> running ceph-0.94.1 on CenttOS 7.1

> [root@mdstest02 bababurko]$ uname -r

> 3.10.0-229.el7.x86_64

>

> Here is my ceph-mds log with 'debug objector = 10' :

>

> https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ=

Ouch!  Unfortunately all we can tell from this is that we're hitting

an assertion somewhere while loading a directory fragment from disk.

As Zheng says, you'll need to drill a bit deeper.  If you were

installing from packages you may find ceph-debuginfo useful.  In

addition to getting us a clearer stack trace with debug symbols,

please also crank "debug mds" up to 20 (this is massively verbose so

hopefully it doesn't take too long to reproduce the issue).

Hopefully this is fairly straightforward to reproduce.  If it's

something fundamentally malformed on disk then just doing a recursive

ls on the filesystem would trigger it, at least.

Cheers,

John

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com