On Wed, Aug 12, 2015 at 1:23 AM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote: > Here is the backtrace from the core dump. > > (gdb) bt > #0 0x00007f71f5404ffb in raise () from /lib64/libpthread.so.0 > #1 0x000000000087065d in reraise_fatal (signum=6) at > global/signal_handler.cc:59 > #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:109 > #3 <signal handler called> > #4 0x00007f71f40235d7 in raise () from /lib64/libc.so.6 > #5 0x00007f71f4024cc8 in abort () from /lib64/libc.so.6 > #6 0x00007f71f49279b5 in __gnu_cxx::__verbose_terminate_handler() () from > /lib64/libstdc++.so.6 > #7 0x00007f71f4925926 in ?? () from /lib64/libstdc++.so.6 > #8 0x00007f71f4925953 in std::terminate() () from /lib64/libstdc++.so.6 > #9 0x00007f71f4925b73 in __cxa_throw () from /lib64/libstdc++.so.6 > #10 0x000000000077d0fc in operator new (num_bytes=2408) at mds/CInode.h:120 > Python Exception <type 'exceptions.IndexError'> list index out of range: > #11 CDir::_omap_fetched (this=0x90af04f8, hdrbl=..., omap=std::map with > 65536 elements, want_dn="", r=<optimized out>) at mds/CDir.cc:1700 > #12 0x00000000007d7d44 in complete (r=0, this=0x502b000) at > include/Context.h:65 > #13 MDSIOContextBase::complete (this=0x502b000, r=0) at mds/MDSContext.cc:59 > #14 0x0000000000894818 in Finisher::finisher_thread_entry (this=0x5108698) > at common/Finisher.cc:59 > #15 0x00007f71f53fddf5 in start_thread () from /lib64/libpthread.so.0 > #16 0x00007f71f40e41ad in clone () from /lib64/libc.so.6 > > I have also gotten a log file w / debug mds = 20. It was 1.2GB, so I > bzip2'd it w max compression and got it down to 75MB. I wasn't sure where > to upload it so if there is a better place to put it, please let me know. > > https://mega.nz/#!5V4z3A7K!0METjVs5t3DAQAts8_TYXWrLh2FhGHcb7oC4uuhr2T8 > please try setting mds_reconnect_timeout to 0. it should make your MDS be able to recover. but this config will make client mounts unusable after MDS recovers. Besides, please use recent client kernel such as 4.0 or use ceph-fuse. > thanks, > Bob > > > On Mon, Aug 10, 2015 at 8:05 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote: >> >> On Tue, Aug 11, 2015 at 9:21 AM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote: >> > I had a dual mds server configuration and have been copying data via >> > cephfs >> > kernel module to my cluster for the past 3 weeks and just had a MDS >> > crash >> > halting all IO. Leading up to the crash, I ran a test dd that increased >> > the >> > throughput by about 2x and stopped it but about 10 minutes later, the >> > MDS >> > server crashed and did not fail over to the standby properly. I have >> > using >> > an active/standby mds configuration but neither of the mds servers will >> > stay >> > running at this point and crash after starting them. >> > >> > [bababurko@cephmon01 ~]$ sudo ceph -s >> > cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79 >> > health HEALTH_WARN >> > mds cluster is degraded >> > mds cephmds02 is laggy >> > noscrub,nodeep-scrub flag(s) set >> > monmap e1: 3 mons at >> > >> > {cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0} >> > election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03 >> > mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)} >> > osdmap e324: 30 osds: 30 up, 30 in >> > flags noscrub,nodeep-scrub >> > pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects >> > 14051 GB used, 13880 GB / 27931 GB avail >> > 2112 active+clean >> > >> > >> > I am not sure what information is relevant so I will try to cover what I >> > think is relevant based on posts I have read through: >> > >> > Cluster: >> > running ceph-0.94.1 on CenttOS 7.1 >> > [root@mdstest02 bababurko]$ uname -r >> > 3.10.0-229.el7.x86_64 >> > >> > Here is my ceph-mds log with 'debug objector = 10' : >> > >> > >> > https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ= >> >> >> could you use gdb to check where the crash happened. (gdb >> /usr/local/bin/ceph-mds /core.xxxxx. maybe you need re-compile mds >> with debuginfo) >> >> Yan, Zheng >> >> > >> > cat /sys/kernel/debug/ceph/*/mdsc output: >> > >> > >> > https://www.zerobin.net/?ed238ce77b20583d#CK7Yt6yC1VgHfDee7y/CGkFh5bfyLkhwZB6i5R6N/8g= >> > >> > ceph.conf : >> > >> > >> > https://www.zerobin.net/?62a125349aa43c92#5VH3XRR4P7zjhBHNWmTHrFYmwE0TZEig6r2EU6X1q/U= >> > >> > I have copied almost 5TB of small files to this cluster which has taken >> > the >> > better part of three weeks, so I am really hoping that there is a way to >> > recover from this. This is ourPOC cluster >> > >> > I'm sure I have missed something relevant as i'm just getting my mind >> > back >> > after nearly losing it, so feel free to ask for anything to assist. >> > >> > Any help would be greatly appreciated. >> > >> > thanks, >> > Bob >> > >> > >> > >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@xxxxxxxxxxxxxx >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com