Re: MDS crash, wont startup again

Greg Farnum <greg@xxxxxxxxxxx> · Mon, 4 Jun 2012 12:49:31 -0700

On Thursday, May 24, 2012 at 5:29 AM, Felix Feinhals wrote:
> Hi,
>  
> i was using the Debian Packages, but i tried now from source.
> I used the same version from GIT
> (cb7f1c9c7520848b0899b26440ac34a8acea58d1) and compiled it. Same crash
> report.
> Then i applied your patch but again the same crash, i think the
> backtrace is also the same:
>  
> (gdb) thread 1
> [Switching to thread 1 (Thread 9564)]#0 0x00007f33a3e58ebb in raise
> (sig=<value optimized out>)
> at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
> 41 in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c
> (gdb) backtrace
> #0 0x00007f33a3e58ebb in raise (sig=<value optimized out>)
> at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
> #1 0x000000000081423e in reraise_fatal (signum=11) at
> global/signal_handler.cc:58 (http://signal_handler.cc:58)
> #2 handle_fatal_signal (signum=11) at global/signal_handler.cc:104 (http://signal_handler.cc:104)
> #3 <signal handler called>
> #4 SnapRealm::have_past_parents_open (this=0x0, first=..., last=...)
> at mds/snap.cc:112 (http://snap.cc:112)
> #5 0x000000000055d58b in MDCache::check_realm_past_parents
> (this=0x27a7200, realm=0x0)
> at mds/MDCache.cc:4495 (http://MDCache.cc:4495)
> #6 0x0000000000572eec in
> MDCache::choose_lock_states_and_reconnect_caps (this=0x27a7200)
> at mds/MDCache.cc:4533 (http://MDCache.cc:4533)
> #7 0x00000000005931a0 in MDCache::rejoin_gather_finish
> (this=0x27a7200) at mds/MDCache.cc:4444 (http://MDCache.cc:4444)
> #8 0x000000000059b9d5 in MDCache::rejoin_send_rejoins
> (this=0x27a7200) at mds/MDCache.cc:3388 (http://MDCache.cc:3388)
> #9 0x00000000004a8721 in MDS::rejoin_joint_start (this=0x27bc000) at
> mds/MDS.cc:1404 (http://MDS.cc:1404)
> #10 0x00000000004c253a in MDS::handle_mds_map (this=0x27bc000,
> m=<value optimized out>)
> at mds/MDS.cc:968 (http://MDS.cc:968)
> #11 0x00000000004c4513 in MDS::handle_core_message (this=0x27bc000,
> m=0x27ab800) at mds/MDS.cc:1651 (http://MDS.cc:1651)
> #12 0x00000000004c45ef in MDS::_dispatch (this=0x27bc000, m=0x27ab800)
> at mds/MDS.cc:1790 (http://MDS.cc:1790)
> #13 0x00000000004c628b in MDS::ms_dispatch (this=0x27bc000,
> m=0x27ab800) at mds/MDS.cc:1602 (http://MDS.cc:1602)
> #14 0x0000000000732609 in Messenger::ms_deliver_dispatch
> (this=0x279f680) at msg/Messenger.h:178
> #15 SimpleMessenger::dispatch_entry (this=0x279f680) at
> msg/SimpleMessenger.cc:363 (http://SimpleMessenger.cc:363)
> #16 0x00000000007207ad in SimpleMessenger::DispatchThread::entry() ()
> #17 0x00007f33a3e508ca in start_thread (arg=<value optimized out>) at
> pthread_create.c:300
> #18 0x00007f33a26d892d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> #19 0x0000000000000000 in ?? ()
>  
> Any more ideas? :)
> Or can i get you more debugging output?

Sorry for the delay — I'm afraid that's a hazard of using the MDS before we're ready to support it. :(
Anyway, I haven't had a lot of time to look into this, but that makes it look like there's an actual problem, where one of the inodes can't find the "SnapRealm" which it lives in. Things that will make this easier to diagnose (in the event that somebody gets the time) include generating high-level debug logs and placing them somewhere accessible (start up the MDS with "debug mds = 20" added to the config file); if you want you could also try the below patch (which will cause the MDS to dump its full inode cache upon triggering this bug) and we can see if there's anything really obvious.
(This is a fine thing to make bug reports on at tracker.newdream.net, btw — and that allows attachments of things like log files.)
-Greg

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index 143faca..6aa5923 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -4527,6 +4527,11 @@ void MDCache::choose_lock_states_and_reconnect_caps()
dout(15) << " chose lock states on " << *in << dendl;

SnapRealm *realm = in->find_snaprealm();
+ if (!realm) {
+ dout(0) << "serious error, could not find snaprealm for in " << *in
+ << ", triggering cache dump" << dendl;
+ dump_cache();
+ }

check_realm_past_parents(realm);



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html