Re: mds server(s) crashed

John Spray <jspray@xxxxxxxxxx> · Tue, 11 Aug 2015 22:25:18 +0100

On Tue, Aug 11, 2015 at 6:23 PM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote:
> Here is the backtrace from the core dump.
>
> (gdb) bt
> #0  0x00007f71f5404ffb in raise () from /lib64/libpthread.so.0
> #1  0x000000000087065d in reraise_fatal (signum=6) at
> global/signal_handler.cc:59
> #2  handle_fatal_signal (signum=6) at global/signal_handler.cc:109
> #3  <signal handler called>
> #4  0x00007f71f40235d7 in raise () from /lib64/libc.so.6
> #5  0x00007f71f4024cc8 in abort () from /lib64/libc.so.6
> #6  0x00007f71f49279b5 in __gnu_cxx::__verbose_terminate_handler() () from
> /lib64/libstdc++.so.6
> #7  0x00007f71f4925926 in ?? () from /lib64/libstdc++.so.6
> #8  0x00007f71f4925953 in std::terminate() () from /lib64/libstdc++.so.6
> #9  0x00007f71f4925b73 in __cxa_throw () from /lib64/libstdc++.so.6
> #10 0x000000000077d0fc in operator new (num_bytes=2408) at mds/CInode.h:120
> Python Exception <type 'exceptions.IndexError'> list index out of range:
> #11 CDir::_omap_fetched (this=0x90af04f8, hdrbl=..., omap=std::map with
> 65536 elements, want_dn="", r=<optimized out>) at mds/CDir.cc:1700
> #12 0x00000000007d7d44 in complete (r=0, this=0x502b000) at
> include/Context.h:65
> #13 MDSIOContextBase::complete (this=0x502b000, r=0) at mds/MDSContext.cc:59
> #14 0x0000000000894818 in Finisher::finisher_thread_entry (this=0x5108698)
> at common/Finisher.cc:59
> #15 0x00007f71f53fddf5 in start_thread () from /lib64/libpthread.so.0
> #16 0x00007f71f40e41ad in clone () from /lib64/libc.so.6

If we believe the line numbers here, then it's a malloc failure.  Are
you running out of memory?

The MDS is loading a bunch of these 64k file directories (presumably a
characteristic of your workload), and ending up with an unusually
large number of inodes in cache (this is all happening during the
"rejoin" phase so no trimming of the cache is done and we merrily
exceed the default mds_cache_size limit of 100k inodes).

The thing triggering the load of the dirs is clients replaying
requests that refer to inodes by inode number, and the MDS's procedure
for handling that involves fully loading the relevant dirs.  That
might be something we can improve, it doesn't seem obviously necessary
to load all the dentrys in a dirfrag during this phase.

Anyway, you can hopefully recover from this state by forcibly
unmounting your clients.  Since you're using the kernel client it may
be easiest to hard reset the client boxes.  When you next restart your
MDS, the clients won't be present, so the MDS will be able to make it
all the way up without trying to load a bunch of directory fragments.
If you've got some more RAM for the MDS box that wouldn't hurt either.

One of the less well tested (but relevant here) features we have is
directory fragmentation, where large dirs like these are internally
split up (partly to avoid memory management issues like this).  It
might be a risky business on a system that you've already got real
data on, but once your MDS is back up and running you can try enabling
the mds_bal_frag setting.

This is not a use case we have particularly strong coverage of in our
automated tests, so thanks for your experimentation and persistence.

John

>
> I have also gotten a log file w / debug mds = 20.  It was 1.2GB, so I
> bzip2'd it w max compression and got it down to 75MB.  I wasn't sure where
> to upload it so if there is a better place to put it, please let me know.
>
> https://mega.nz/#!5V4z3A7K!0METjVs5t3DAQAts8_TYXWrLh2FhGHcb7oC4uuhr2T8
>
> thanks,
> Bob
>
>
> On Mon, Aug 10, 2015 at 8:05 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>>
>> On Tue, Aug 11, 2015 at 9:21 AM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote:
>> > I had a dual mds server configuration and have been copying data via
>> > cephfs
>> > kernel module to my cluster for the past 3 weeks and just had a MDS
>> > crash
>> > halting all IO.  Leading up to the crash, I ran a test dd that increased
>> > the
>> > throughput by about 2x and stopped it but about 10 minutes later, the
>> > MDS
>> > server crashed and did not fail over to the standby properly. I have
>> > using
>> > an active/standby mds configuration but neither of the mds servers will
>> > stay
>> > running at this point and crash after starting them.
>> >
>> > [bababurko@cephmon01 ~]$ sudo ceph -s
>> >     cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
>> >      health HEALTH_WARN
>> >             mds cluster is degraded
>> >             mds cephmds02 is laggy
>> >             noscrub,nodeep-scrub flag(s) set
>> >      monmap e1: 3 mons at
>> >
>> > {cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0}
>> >             election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03
>> >      mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)}
>> >      osdmap e324: 30 osds: 30 up, 30 in
>> >             flags noscrub,nodeep-scrub
>> >       pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects
>> >             14051 GB used, 13880 GB / 27931 GB avail
>> >                 2112 active+clean
>> >
>> >
>> > I am not sure what information is relevant so I will try to cover what I
>> > think is relevant based on posts I have read through:
>> >
>> > Cluster:
>> > running ceph-0.94.1 on CenttOS 7.1
>> > [root@mdstest02 bababurko]$ uname -r
>> > 3.10.0-229.el7.x86_64
>> >
>> > Here is my ceph-mds log with 'debug objector = 10' :
>> >
>> >
>> > https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ=
>>
>>
>> could you use gdb to check where the crash happened. (gdb
>> /usr/local/bin/ceph-mds /core.xxxxx.  maybe you need re-compile mds
>> with debuginfo)
>>
>> Yan, Zheng
>>
>> >
>> > cat /sys/kernel/debug/ceph/*/mdsc output:
>> >
>> >
>> > https://www.zerobin.net/?ed238ce77b20583d#CK7Yt6yC1VgHfDee7y/CGkFh5bfyLkhwZB6i5R6N/8g=
>> >
>> > ceph.conf :
>> >
>> >
>> > https://www.zerobin.net/?62a125349aa43c92#5VH3XRR4P7zjhBHNWmTHrFYmwE0TZEig6r2EU6X1q/U=
>> >
>> > I have copied almost 5TB of small files to this cluster which has taken
>> > the
>> > better part of three weeks, so I am really hoping that there is a way to
>> > recover from this.  This is ourPOC cluster
>> >
>> > I'm sure I have missed something relevant as i'm just getting my mind
>> > back
>> > after nearly losing it, so feel free to ask for anything to assist.
>> >
>> > Any help would be greatly appreciated.
>> >
>> > thanks,
>> > Bob
>> >
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com