On Wed, Aug 12, 2015 at 5:53 AM, John Spray <jspray@xxxxxxxxxx> wrote: > For the record: I've created issue #12671 to improve our memory > management in this type of situation. > > John > > http://tracker.ceph.com/issues/12671 this situation has been improved in recent clients. recent clients trim their cache first, then send cap reconnect to MDS. > > On Tue, Aug 11, 2015 at 10:25 PM, John Spray <jspray@xxxxxxxxxx> wrote: >> On Tue, Aug 11, 2015 at 6:23 PM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote: >>> Here is the backtrace from the core dump. >>> >>> (gdb) bt >>> #0 0x00007f71f5404ffb in raise () from /lib64/libpthread.so.0 >>> #1 0x000000000087065d in reraise_fatal (signum=6) at >>> global/signal_handler.cc:59 >>> #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:109 >>> #3 <signal handler called> >>> #4 0x00007f71f40235d7 in raise () from /lib64/libc.so.6 >>> #5 0x00007f71f4024cc8 in abort () from /lib64/libc.so.6 >>> #6 0x00007f71f49279b5 in __gnu_cxx::__verbose_terminate_handler() () from >>> /lib64/libstdc++.so.6 >>> #7 0x00007f71f4925926 in ?? () from /lib64/libstdc++.so.6 >>> #8 0x00007f71f4925953 in std::terminate() () from /lib64/libstdc++.so.6 >>> #9 0x00007f71f4925b73 in __cxa_throw () from /lib64/libstdc++.so.6 >>> #10 0x000000000077d0fc in operator new (num_bytes=2408) at mds/CInode.h:120 >>> Python Exception <type 'exceptions.IndexError'> list index out of range: >>> #11 CDir::_omap_fetched (this=0x90af04f8, hdrbl=..., omap=std::map with >>> 65536 elements, want_dn="", r=<optimized out>) at mds/CDir.cc:1700 >>> #12 0x00000000007d7d44 in complete (r=0, this=0x502b000) at >>> include/Context.h:65 >>> #13 MDSIOContextBase::complete (this=0x502b000, r=0) at mds/MDSContext.cc:59 >>> #14 0x0000000000894818 in Finisher::finisher_thread_entry (this=0x5108698) >>> at common/Finisher.cc:59 >>> #15 0x00007f71f53fddf5 in start_thread () from /lib64/libpthread.so.0 >>> #16 0x00007f71f40e41ad in clone () from /lib64/libc.so.6 >> >> If we believe the line numbers here, then it's a malloc failure. Are >> you running out of memory? >> >> The MDS is loading a bunch of these 64k file directories (presumably a >> characteristic of your workload), and ending up with an unusually >> large number of inodes in cache (this is all happening during the >> "rejoin" phase so no trimming of the cache is done and we merrily >> exceed the default mds_cache_size limit of 100k inodes). >> >> The thing triggering the load of the dirs is clients replaying >> requests that refer to inodes by inode number, and the MDS's procedure >> for handling that involves fully loading the relevant dirs. That >> might be something we can improve, it doesn't seem obviously necessary >> to load all the dentrys in a dirfrag during this phase. >> >> Anyway, you can hopefully recover from this state by forcibly >> unmounting your clients. Since you're using the kernel client it may >> be easiest to hard reset the client boxes. When you next restart your >> MDS, the clients won't be present, so the MDS will be able to make it >> all the way up without trying to load a bunch of directory fragments. >> If you've got some more RAM for the MDS box that wouldn't hurt either. >> >> One of the less well tested (but relevant here) features we have is >> directory fragmentation, where large dirs like these are internally >> split up (partly to avoid memory management issues like this). It >> might be a risky business on a system that you've already got real >> data on, but once your MDS is back up and running you can try enabling >> the mds_bal_frag setting. >> >> This is not a use case we have particularly strong coverage of in our >> automated tests, so thanks for your experimentation and persistence. >> >> John >> >>> >>> I have also gotten a log file w / debug mds = 20. It was 1.2GB, so I >>> bzip2'd it w max compression and got it down to 75MB. I wasn't sure where >>> to upload it so if there is a better place to put it, please let me know. >>> >>> https://mega.nz/#!5V4z3A7K!0METjVs5t3DAQAts8_TYXWrLh2FhGHcb7oC4uuhr2T8 >>> >>> thanks, >>> Bob >>> >>> >>> On Mon, Aug 10, 2015 at 8:05 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote: >>>> >>>> On Tue, Aug 11, 2015 at 9:21 AM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote: >>>> > I had a dual mds server configuration and have been copying data via >>>> > cephfs >>>> > kernel module to my cluster for the past 3 weeks and just had a MDS >>>> > crash >>>> > halting all IO. Leading up to the crash, I ran a test dd that increased >>>> > the >>>> > throughput by about 2x and stopped it but about 10 minutes later, the >>>> > MDS >>>> > server crashed and did not fail over to the standby properly. I have >>>> > using >>>> > an active/standby mds configuration but neither of the mds servers will >>>> > stay >>>> > running at this point and crash after starting them. >>>> > >>>> > [bababurko@cephmon01 ~]$ sudo ceph -s >>>> > cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79 >>>> > health HEALTH_WARN >>>> > mds cluster is degraded >>>> > mds cephmds02 is laggy >>>> > noscrub,nodeep-scrub flag(s) set >>>> > monmap e1: 3 mons at >>>> > >>>> > {cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0} >>>> > election epoch 4, quorum 0,1,2 cephmon01,cephmon02,cephmon03 >>>> > mdsmap e2760: 1/1/1 up {0=cephmds02=up:rejoin(laggy or crashed)} >>>> > osdmap e324: 30 osds: 30 up, 30 in >>>> > flags noscrub,nodeep-scrub >>>> > pgmap v1555346: 2112 pgs, 3 pools, 4993 GB data, 246 Mobjects >>>> > 14051 GB used, 13880 GB / 27931 GB avail >>>> > 2112 active+clean >>>> > >>>> > >>>> > I am not sure what information is relevant so I will try to cover what I >>>> > think is relevant based on posts I have read through: >>>> > >>>> > Cluster: >>>> > running ceph-0.94.1 on CenttOS 7.1 >>>> > [root@mdstest02 bababurko]$ uname -r >>>> > 3.10.0-229.el7.x86_64 >>>> > >>>> > Here is my ceph-mds log with 'debug objector = 10' : >>>> > >>>> > >>>> > https://www.zerobin.net/?179a6789dfc9eb86#AHAS3YEkpHTj6CSQg8u4hk+jHBasejQNLDc9/KYkYVQ= >>>> >>>> >>>> could you use gdb to check where the crash happened. (gdb >>>> /usr/local/bin/ceph-mds /core.xxxxx. maybe you need re-compile mds >>>> with debuginfo) >>>> >>>> Yan, Zheng >>>> >>>> > >>>> > cat /sys/kernel/debug/ceph/*/mdsc output: >>>> > >>>> > >>>> > https://www.zerobin.net/?ed238ce77b20583d#CK7Yt6yC1VgHfDee7y/CGkFh5bfyLkhwZB6i5R6N/8g= >>>> > >>>> > ceph.conf : >>>> > >>>> > >>>> > https://www.zerobin.net/?62a125349aa43c92#5VH3XRR4P7zjhBHNWmTHrFYmwE0TZEig6r2EU6X1q/U= >>>> > >>>> > I have copied almost 5TB of small files to this cluster which has taken >>>> > the >>>> > better part of three weeks, so I am really hoping that there is a way to >>>> > recover from this. This is ourPOC cluster >>>> > >>>> > I'm sure I have missed something relevant as i'm just getting my mind >>>> > back >>>> > after nearly losing it, so feel free to ask for anything to assist. >>>> > >>>> > Any help would be greatly appreciated. >>>> > >>>> > thanks, >>>> > Bob >>>> > >>>> > >>>> > >>>> > _______________________________________________ >>>> > ceph-users mailing list >>>> > ceph-users@xxxxxxxxxxxxxx >>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> > >>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com