Re: mds server(s) crashed

Bob Ababurko <bob@xxxxxxxxxxxx> · Tue, 11 Aug 2015 10:23:19 -0700

Here is the backtrace from the core dump. 
(gdb) bt
#0  0x00007f71f5404ffb in raise () from /lib64/libpthread.so.0
#1  0x000000000087065d in reraise_fatal (signum=6) at global/signal_handler.cc:59
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:109
#3  <signal handler called>
#4  0x00007f71f40235d7 in raise () from /lib64/libc.so.6
#5  0x00007f71f4024cc8 in abort () from /lib64/libc.so.6
#6  0x00007f71f49279b5 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#7  0x00007f71f4925926 in ?? () from /lib64/libstdc++.so.6
#8  0x00007f71f4925953 in std::terminate() () from /lib64/libstdc++.so.6
#9  0x00007f71f4925b73 in __cxa_throw () from /lib64/libstdc++.so.6
#10 0x000000000077d0fc in operator new (num_bytes=2408) at mds/CInode.h:120
Python Exception <type 'exceptions.IndexError'> list index out of range:
#11 CDir::_omap_fetched (this=0x90af04f8, hdrbl=..., omap=std::map with 65536 elements, want_dn="", r=<optimized out>) at mds/CDir.cc:1700
#12 0x00000000007d7d44 in complete (r=0, this=0x502b000) at include/Context.h:65
#13 MDSIOContextBase::complete (this=0x502b000, r=0) at mds/MDSContext.cc:59
#14 0x0000000000894818 in Finisher::finisher_thread_entry (this=0x5108698) at common/Finisher.cc:59
#15 0x00007f71f53fddf5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f71f40e41ad in clone () from /lib64/libc.so.6

I have also gotten a log file w / debug mds = 20.  It was 1.2GB, so I bzip2'd it w max compression and got it down to 75MB.  I wasn't sure where to upload it so if there is a better place to put it, please let me know.

https://mega.nz/#!5V4z3A7K!0METjVs5t3DAQAts8_TYXWrLh2FhGHcb7oC4uuhr2T8

thanks,
Bob

On Mon, Aug 10, 2015 at 8:05 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
On Tue, Aug 11, 2015 at 9:21 AM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote:

> I had a dual mds server configuration and have been copying data via cephfs

> kernel module to my cluster for the past 3 weeks and just had a MDS crash

> halting all IO.  Leading up to the crash, I ran a test dd that increased the

> throughput by about 2x and stopped it but about 10 minutes later, the MDS

> server crashed and did not fail over to the standby properly. I have using

> an active/standby mds configuration but neither of the mds servers will stay

> running at this point and crash after starting them.

>

> [bababurko@cephmon01 ~]$ sudo ceph -s

>     cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79

>      health HEALTH_WARN

>             mds cluster is degraded

>             mds cephmds02 is laggy

>             noscrub,nodeep-scrub flag(s) set

>      monmap e1: 3 mons at

> {cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0}

>             election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03

>      mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)}

>      osdmap e324: 30 osds: 30 up, 30 in

>             flags noscrub,nodeep-scrub

>       pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects

>             14051 GB used, 13880 GB / 27931 GB avail

>                 2112 active+clean

>

>

> I am not sure what information is relevant so I will try to cover what I

> think is relevant based on posts I have read through:

>

> Cluster:

> running ceph-0.94.1 on CenttOS 7.1

> [root@mdstest02 bababurko]$ uname -r

> 3.10.0-229.el7.x86_64

>

> Here is my ceph-mds log with 'debug objector = 10' :

>

> https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ=

could you use gdb to check where the crash happened. (gdb

/usr/local/bin/ceph-mds /core.xxxxx.  maybe you need re-compile mds

with debuginfo)

Yan, Zheng

>

> cat /sys/kernel/debug/ceph/*/mdsc output:

>

> https://www.zerobin.net/?ed238ce77b20583d#CK7Yt6yC1VgHfDee7y/CGkFh5bfyLkhwZB6i5R6N/8g=

>

> ceph.conf :

>

> https://www.zerobin.net/?62a125349aa43c92#5VH3XRR4P7zjhBHNWmTHrFYmwE0TZEig6r2EU6X1q/U=

>

> I have copied almost 5TB of small files to this cluster which has taken the

> better part of three weeks, so I am really hoping that there is a way to

> recover from this.  This is ourPOC cluster

>

> I'm sure I have missed something relevant as i'm just getting my mind back

> after nearly losing it, so feel free to ask for anything to assist.

>

> Any help would be greatly appreciated.

>

> thanks,

> Bob

>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com