On Fri, Nov 4, 2016 at 2:54 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > I upgraded to 10.2.3 today and after restarting the MDS, the same or very similar problem occurred. I didn't see any of the symlink > errors, so I think that was fixed in the upgrade but I was still seeing looping and crashing until I killed all clients and evicted > sessions. > > Just after restarting MDS > > 2016-11-04 13:26:30.373343 7f501f085700 0 mds.0.cache creating system inode with ino:100 > 2016-11-04 13:26:30.374377 7f501f085700 0 mds.0.cache creating system inode with ino:1 > 2016-11-04 13:26:31.290498 7f501d275700 1 mds.0.1370673 replay_done > 2016-11-04 13:26:31.294583 7f501d275700 1 mds.0.1370673 making mds journal writeable > 2016-11-04 13:26:31.779444 7f5024392700 1 mds.0.1370673 handle_mds_map i am now mds.0.1370673 > 2016-11-04 13:26:31.779470 7f5024392700 1 mds.0.1370673 handle_mds_map state change up:replay --> up:reconnect > 2016-11-04 13:26:31.779477 7f5024392700 1 mds.0.1370673 reconnect_start > 2016-11-04 13:26:31.779479 7f5024392700 1 mds.0.1370673 reopen_log > 2016-11-04 13:26:31.779485 7f5024392700 1 mds.0.server reconnect_clients -- 11 sessions > 2016-11-04 13:26:31.785040 7f5024392700 0 log_channel(cluster) log [DBG] : reconnect by client.2623964 10.1.103.231:0/2164520513 > after 0.005521 > 2016-11-04 13:26:31.785218 7f5024392700 0 log_channel(cluster) log [DBG] : reconnect by client.2765641 10.1.103.231:0/1622816431 > after 0.005723 > 2016-11-04 13:26:31.786122 7f5024392700 0 log_channel(cluster) log [DBG] : reconnect by client.2623967 10.1.103.232:0/342252210 > after 0.006627 > 2016-11-04 13:26:31.786713 7f5024392700 0 log_channel(cluster) log [DBG] : reconnect by client.2719695 10.1.103.232:0/1211624383 > after 0.007219 > 2016-11-04 13:26:31.798640 7f5024392700 0 log_channel(cluster) log [DBG] : reconnect by client.2719692 10.1.103.233:0/174968170 > after 0.019143 > 2016-11-04 13:26:31.800473 7f5024392700 0 log_channel(cluster) log [DBG] : reconnect by client.2719698 10.1.103.233:0/2257480425 > after 0.020906 > 2016-11-04 13:26:31.801293 7f5024392700 0 log_channel(cluster) log [DBG] : reconnect by client.2765635 10.1.103.231:0/2313865568 > after 0.021787 > 2016-11-04 13:26:31.886733 7f5024392700 0 log_channel(cluster) log [DBG] : reconnect by client.2719689 10.1.103.232:0/3215599620 > after 0.107140 > 2016-11-04 13:26:31.938048 7f5024392700 0 log_channel(cluster) log [DBG] : reconnect by client.2765770 10.1.106.7:0/789497307 after > 0.157027 > 2016-11-04 13:26:31.962209 7f5024392700 0 log_channel(cluster) log [DBG] : reconnect by client.2624111 10.1.106.3:0/2929007149 > after 0.182597 > 2016-11-04 13:26:31.984629 7f5024392700 0 log_channel(cluster) log [DBG] : reconnect by client.2765638 10.1.103.233:0/2247183319 > after 0.205018 > 2016-11-04 13:26:32.029738 7f5024392700 1 mds.0.1370673 reconnect_done > 2016-11-04 13:26:32.785680 7f5024392700 1 mds.0.1370673 handle_mds_map i am now mds.0.1370673 > 2016-11-04 13:26:32.785715 7f5024392700 1 mds.0.1370673 handle_mds_map state change up:reconnect --> up:rejoin > 2016-11-04 13:26:32.785731 7f5024392700 1 mds.0.1370673 rejoin_start > 2016-11-04 13:26:38.572270 7f5024392700 1 mds.0.1370673 rejoin_joint_start > 2016-11-04 13:26:43.971915 7f5020289700 1 mds.0.1370673 rejoin_done > 2016-11-04 13:26:44.375857 7f5024392700 1 mds.0.1370673 handle_mds_map i am now mds.0.1370673 > 2016-11-04 13:26:44.375871 7f5024392700 1 mds.0.1370673 handle_mds_map state change up:rejoin --> up:active > 2016-11-04 13:26:44.375878 7f5024392700 1 mds.0.1370673 recovery_done -- successful recovery! > 2016-11-04 13:26:44.375949 7f5024392700 1 mds.0.1370673 active_start > 2016-11-04 13:26:44.519621 7f5024392700 1 mds.0.1370673 cluster recovered. The question here is what's sending the SIGTERM? It could be the OOM killer, but given that your system made it to active in just a few seconds, it seems unlikely that it could have eaten that much RAM that quickly unless there was something else on the system consuming the available memory? > 2016-11-04 13:27:03.651175 7f5020a8a700 -1 mds.gp-ceph-mds1 *** got signal Terminated *** > 2016-11-04 13:27:03.651191 7f5020a8a700 1 mds.gp-ceph-mds1 suicide. wanted state up:active > > A section whilst its looping > > 2016-11-04 13:35:11.255310 7f945df1c700 1 mds.gp-ceph-mds1 handle_mds_map i (10.1.2.76:6800/8619) dne in the mdsmap, respawning > myself > 2016-11-04 13:35:11.255330 7f945df1c700 1 mds.gp-ceph-mds1 respawn > 2016-11-04 13:35:11.255339 7f945df1c700 1 mds.gp-ceph-mds1 e: '/usr/bin/ceph-mds' > 2016-11-04 13:35:11.255346 7f945df1c700 1 mds.gp-ceph-mds1 0: '/usr/bin/ceph-mds' > 2016-11-04 13:35:11.255353 7f945df1c700 1 mds.gp-ceph-mds1 1: '-f' > 2016-11-04 13:35:11.255359 7f945df1c700 1 mds.gp-ceph-mds1 2: '--cluster' > 2016-11-04 13:35:11.255366 7f945df1c700 1 mds.gp-ceph-mds1 3: 'ceph' > 2016-11-04 13:35:11.255372 7f945df1c700 1 mds.gp-ceph-mds1 4: '--id' > 2016-11-04 13:35:11.255381 7f945df1c700 1 mds.gp-ceph-mds1 5: 'gp-ceph-mds1' > 2016-11-04 13:35:11.255388 7f945df1c700 1 mds.gp-ceph-mds1 6: '--setuser' > 2016-11-04 13:35:11.255395 7f945df1c700 1 mds.gp-ceph-mds1 7: 'ceph' > 2016-11-04 13:35:11.255401 7f945df1c700 1 mds.gp-ceph-mds1 8: '--setgroup' > 2016-11-04 13:35:11.255407 7f945df1c700 1 mds.gp-ceph-mds1 9: 'ceph' > 2016-11-04 13:35:11.255437 7f945df1c700 1 mds.gp-ceph-mds1 exe_path /usr/bin/ceph-mds > 2016-11-04 13:35:11.259542 7fcf47754700 1 mds.0.1371026 handle_mds_map i am now mds.0.1371026 > 2016-11-04 13:35:11.259549 7fcf47754700 1 mds.0.1371026 handle_mds_map state change up:boot --> up:replay > 2016-11-04 13:35:11.259563 7fcf47754700 1 mds.0.1371026 replay_start > 2016-11-04 13:35:11.259565 7fcf47754700 1 mds.0.1371026 recovery set is > 2016-11-04 13:35:11.259569 7fcf47754700 1 mds.0.1371026 waiting for osdmap 1382368 (which blacklists prior instance) > 2016-11-04 13:35:11.266088 7fcf42447700 0 mds.0.cache creating system inode with ino:100 > 2016-11-04 13:35:11.266207 7fcf42447700 0 mds.0.cache creating system inode with ino:1 > 2016-11-04 13:35:11.270580 7fe46b86c200 0 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b), process ceph-mds, pid > 8619 > 2016-11-04 13:35:11.271530 7fe46b86c200 0 pidfile_write: ignore empty --pid-file > 2016-11-04 13:35:11.507023 7fcf40637700 1 mds.0.1371026 replay_done > 2016-11-04 13:35:11.507077 7fcf40637700 1 mds.0.1371026 making mds journal writeable > 2016-11-04 13:35:12.366251 7fcf47754700 1 mds.0.1371026 handle_mds_map i am now mds.0.1371026 > 2016-11-04 13:35:12.366260 7fcf47754700 1 mds.0.1371026 handle_mds_map state change up:replay --> up:reconnect > 2016-11-04 13:35:12.366268 7fcf47754700 1 mds.0.1371026 reconnect_start > 2016-11-04 13:35:12.366269 7fcf47754700 1 mds.0.1371026 reopen_log > 2016-11-04 13:35:12.366284 7fcf47754700 1 mds.0.server reconnect_clients -- 9 sessions > 2016-11-04 13:35:12.366317 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2623964 10.1.103.231:0/2164520513 > after 0.000010 > 2016-11-04 13:35:12.366476 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2719692 10.1.103.233:0/174968170 > after 0.000184 > 2016-11-04 13:35:12.367048 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2719695 10.1.103.232:0/1211624383 > after 0.000752 > 2016-11-04 13:35:12.371509 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2765641 10.1.103.231:0/1622816431 > after 0.005214 > 2016-11-04 13:35:12.377333 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2623967 10.1.103.232:0/342252210 > after 0.011038 > 2016-11-04 13:35:12.377782 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2765635 10.1.103.231:0/2313865568 > after 0.011485 > 2016-11-04 13:35:12.396777 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2719689 10.1.103.232:0/3215599620 > after 0.030460 > 2016-11-04 13:35:12.418640 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2624111 10.1.106.3:0/2929007149 > after 0.052235 > 2016-11-04 13:35:15.407765 7fcf3de16700 0 -- 10.1.2.76:6805/1334 >> 10.1.103.232:0/3215599620 pipe(0x55d529479400 sd=1787 :6805 s=2 > pgs=6645 cs=1 l=0 c=0x55d529954a80).fault with nothing to send, going to standby > 2016-11-04 13:35:15.409659 7fe465752700 1 mds.gp-ceph-mds1 handle_mds_map standby > 2016-11-04 13:35:15.412866 7fcf47754700 1 mds.gp-ceph-mds1 handle_mds_map i (10.1.2.76:6805/1334) dne in the mdsmap, respawning > myself > 2016-11-04 13:35:15.412886 7fcf47754700 1 mds.gp-ceph-mds1 respawn > > Some of the Asserts > > 2016-11-04 13:26:30.344284 7f03fd42e700 -1 mds/MDSDaemon.cc: In function 'void MDSDaemon::respawn()' thread 7f03fd42e700 time > 2016-11-04 13:26:30.329841 > mds/MDSDaemon.cc: 1132: FAILED assert(0) This is the assert you hit if the MDS is trying to respawn itself but it can't. A slightly earlier log message might say why? > ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x557546692e80] > 2: (MDSDaemon::respawn()+0x73d) [0x5575462789fd] > 3: (MDSDaemon::handle_mds_map(MMDSMap*)+0x1517) [0x557546281667] > 4: (MDSDaemon::handle_core_message(Message*)+0x7f3) [0x557546284a03] > 5: (MDSDaemon::ms_dispatch(Message*)+0x1c3) [0x557546284cf3] > 6: (DispatchQueue::entry()+0xf2b) [0x557546799f6b] > 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x55754667911d] > 8: (()+0x76fa) [0x7f04026b06fa] > 9: (clone()+0x6d) [0x7f0400b71b5d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > 2016-11-04 13:36:03.805549 7f924ded0700 1 mds.gp-ceph-mds1 asok_command: session evict (starting...) > 2016-11-04 13:36:03.809513 7f924ded0700 -1 mds/MDLog.cc: In function 'void MDLog::_submit_entry(LogEvent*, MDSInternalContextBase*)' > thread 7f924ded0700 time 2016-11-04 13:36:03.805586 > mds/MDLog.cc: 275: FAILED assert(!mds->is_any_replay()) This is a bug, but presumably not the root of your problems (http://tracker.ceph.com/issues/17801) > ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55c691e22ef0] > 2: (MDLog::_submit_entry(LogEvent*, MDSInternalContextBase*)+0x47f) [0x55c691c8189f] > 3: (Server::journal_close_session(Session*, int, Context*)+0x6df) [0x55c691a6059f] > 4: (Server::kill_session(Session*, Context*)+0x1bd) [0x55c691a623bd] > 5: (MDSRankDispatcher::handle_asok_command(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, > std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, > boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> ><bool, long, double, > std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, > std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >, > std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, > std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, > std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> ><bool, long, double, > std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, > std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > >&, ceph::Formatter*, > std::ostream&)+0x491) [0x55c691a31df1] > 6: (MDSDaemon::asok_command(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, > std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, > boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> ><bool, long, double, > std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, > std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >, > std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, > std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, > std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> ><bool, long, double, > std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, > std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > >&, > std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::ostream&)+0x222) [0x55c691a00152] > 7: (MDSSocketHook::call(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, > std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, > boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> ><bool, long, double, > std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, > std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >, > std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, > std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, > std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> ><bool, long, double, > std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, > std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > >&, > std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list&)+0xa8) [0x55c691a10db8] > 8: (AdminSocket::do_accept()+0x1267) [0x55c691e1c987] > 9: (AdminSocket::entry()+0x298) [0x55c691e1df48] > 10: (()+0x770a) [0x7f925215070a] > 11: (clone()+0x6d) [0x7f925061182d] > > > >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Nick Fisk >> Sent: 02 November 2016 21:50 >> To: 'Ceph Users' <ceph-users@xxxxxxxxxxxxxx> >> Subject: Re: MDS Problems - Solved but reporting for benefit of others >> >> A bit more digging, the original crash appears to be similar (but not exactly the same) as this tracker report >> >> http://tracker.ceph.com/issues/16983 >> >> I can see that this was fixed in 10.2.3, so I will probably look to upgrade. >> >> If the logs make sense to anybody with a bit more knowledge I would be interested if that bug is related or if I have stumbled on >> something new. >> >> Nick >> >> > -----Original Message----- >> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf >> > Of Nick Fisk >> > Sent: 02 November 2016 17:58 >> > To: 'Ceph Users' <ceph-users@xxxxxxxxxxxxxx> >> > Subject: MDS Problems - Solved but reporting for benefit >> > of others >> > >> > Hi all, >> > >> > Just a bit of an outage with CephFS around the MDS's, I managed to get >> > everything up and running again after a bit of head >> scratching >> > and thought I would share here what happened. >> > >> > Cause >> > I believe the MDS's which were running as VM's suffered when the >> > hypervisor ran out of ram and started swapping due to hypervisor maintenance. I know this is less than ideal and have put steps > in >> place to prevent this happening again. >> > >> > Symptoms >> > 1. Noticed that both MDS's were down, log files on both showed that >> > they had crashed 2. After restarting MDS's, their status kept flipping >> > between replay and reconnect 3. Now again both MDS's would crash again >> > 4. Log files showed they seemed to keep restarting after trying to >> > reconnect clients 5. Clients were all kernel one was 3.19 and the rest >> > 4.8. I believe the problematic client was >> one of the >> > ones running Kernel 4.8 6. Ceph is 10.2.2 >> > >> > Resolution >> > After some serious head scratching and a little bit of panicking, the >> > fact the log files showed the restart always happened after >> trying >> > to reconnect the clients gave me the idea to try and kill the sessions >> > on the MDS. I first reset all the clients and waited, but >> this didn't >> > seem to have any effect and I could still see the MDS trying to >> > reconnect to the clients. I then decided to try and kill the >> sessions from >> > the MDS end, so I shutdown the standby MDS (as they kept flipping >> > active roles) and ran >> > >> > ceph daemon mds.gp-ceph-mds1 session ls >> > >> > I then tried to kill the last session in the list >> > >> > ceph daemon mds.gp-ceph-mds1 session evict <session id> >> > >> > I had to keep hammering this command to get it at the right point, as the MDS was only responding for a fraction of a second. >> > >> > Suddenly in my other window, where I had the tail of the MDS log, I >> > saw a whizz of new information and then stopping with the MDS success >> > message. So it seems something the MDS was trying to do whilst reconnecting was upsetting it. Ceph -s updated so show MDS was >> now active. Rebooting other MDS then corrected made it standby as well. Problem solved. >> > >> > I have uploaded the 2 MDS logs here if any CephFS dev's are interested in taking a closer look. >> > >> > >> > Nick >> > >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@xxxxxxxxxxxxxx >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com