Hello Nick, On Fri, Nov 4, 2016 at 9:54 AM, Nick Fisk <nick@xxxxxxxxxx> wrote: > I upgraded to 10.2.3 today and after restarting the MDS, the same or very similar problem occurred. I didn't see any of the symlink > errors, so I think that was fixed in the upgrade but I was still seeing looping and crashing until I killed all clients and evicted > sessions. > [...] > 2016-11-04 13:35:11.255310 7f945df1c700 1 mds.gp-ceph-mds1 handle_mds_map i (10.1.2.76:6800/8619) dne in the mdsmap, respawning > myself > 2016-11-04 13:35:11.255330 7f945df1c700 1 mds.gp-ceph-mds1 respawn > 2016-11-04 13:35:11.255339 7f945df1c700 1 mds.gp-ceph-mds1 e: '/usr/bin/ceph-mds' > 2016-11-04 13:35:11.255346 7f945df1c700 1 mds.gp-ceph-mds1 0: '/usr/bin/ceph-mds' > 2016-11-04 13:35:11.255353 7f945df1c700 1 mds.gp-ceph-mds1 1: '-f' > 2016-11-04 13:35:11.255359 7f945df1c700 1 mds.gp-ceph-mds1 2: '--cluster' > 2016-11-04 13:35:11.255366 7f945df1c700 1 mds.gp-ceph-mds1 3: 'ceph' > 2016-11-04 13:35:11.255372 7f945df1c700 1 mds.gp-ceph-mds1 4: '--id' > 2016-11-04 13:35:11.255381 7f945df1c700 1 mds.gp-ceph-mds1 5: 'gp-ceph-mds1' > 2016-11-04 13:35:11.255388 7f945df1c700 1 mds.gp-ceph-mds1 6: '--setuser' > 2016-11-04 13:35:11.255395 7f945df1c700 1 mds.gp-ceph-mds1 7: 'ceph' > 2016-11-04 13:35:11.255401 7f945df1c700 1 mds.gp-ceph-mds1 8: '--setgroup' > 2016-11-04 13:35:11.255407 7f945df1c700 1 mds.gp-ceph-mds1 9: 'ceph' > 2016-11-04 13:35:11.255437 7f945df1c700 1 mds.gp-ceph-mds1 exe_path /usr/bin/ceph-mds > 2016-11-04 13:35:11.259542 7fcf47754700 1 mds.0.1371026 handle_mds_map i am now mds.0.1371026 > 2016-11-04 13:35:11.259549 7fcf47754700 1 mds.0.1371026 handle_mds_map state change up:boot --> up:replay > 2016-11-04 13:35:11.259563 7fcf47754700 1 mds.0.1371026 replay_start > 2016-11-04 13:35:11.259565 7fcf47754700 1 mds.0.1371026 recovery set is > 2016-11-04 13:35:11.259569 7fcf47754700 1 mds.0.1371026 waiting for osdmap 1382368 (which blacklists prior instance) > 2016-11-04 13:35:11.266088 7fcf42447700 0 mds.0.cache creating system inode with ino:100 > 2016-11-04 13:35:11.266207 7fcf42447700 0 mds.0.cache creating system inode with ino:1 > 2016-11-04 13:35:11.270580 7fe46b86c200 0 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b), process ceph-mds, pid > 8619 > 2016-11-04 13:35:11.271530 7fe46b86c200 0 pidfile_write: ignore empty --pid-file > 2016-11-04 13:35:11.507023 7fcf40637700 1 mds.0.1371026 replay_done > 2016-11-04 13:35:11.507077 7fcf40637700 1 mds.0.1371026 making mds journal writeable > 2016-11-04 13:35:12.366251 7fcf47754700 1 mds.0.1371026 handle_mds_map i am now mds.0.1371026 > 2016-11-04 13:35:12.366260 7fcf47754700 1 mds.0.1371026 handle_mds_map state change up:replay --> up:reconnect > 2016-11-04 13:35:12.366268 7fcf47754700 1 mds.0.1371026 reconnect_start > 2016-11-04 13:35:12.366269 7fcf47754700 1 mds.0.1371026 reopen_log > 2016-11-04 13:35:12.366284 7fcf47754700 1 mds.0.server reconnect_clients -- 9 sessions > 2016-11-04 13:35:12.366317 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2623964 10.1.103.231:0/2164520513 > after 0.000010 > 2016-11-04 13:35:12.366476 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2719692 10.1.103.233:0/174968170 > after 0.000184 > 2016-11-04 13:35:12.367048 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2719695 10.1.103.232:0/1211624383 > after 0.000752 > 2016-11-04 13:35:12.371509 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2765641 10.1.103.231:0/1622816431 > after 0.005214 > 2016-11-04 13:35:12.377333 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2623967 10.1.103.232:0/342252210 > after 0.011038 > 2016-11-04 13:35:12.377782 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2765635 10.1.103.231:0/2313865568 > after 0.011485 > 2016-11-04 13:35:12.396777 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2719689 10.1.103.232:0/3215599620 > after 0.030460 > 2016-11-04 13:35:12.418640 7fcf47754700 0 log_channel(cluster) log [DBG] : reconnect by client.2624111 10.1.106.3:0/2929007149 > after 0.052235 > 2016-11-04 13:35:15.407765 7fcf3de16700 0 -- 10.1.2.76:6805/1334 >> 10.1.103.232:0/3215599620 pipe(0x55d529479400 sd=1787 :6805 s=2 > pgs=6645 cs=1 l=0 c=0x55d529954a80).fault with nothing to send, going to standby > 2016-11-04 13:35:15.409659 7fe465752700 1 mds.gp-ceph-mds1 handle_mds_map standby > 2016-11-04 13:35:15.412866 7fcf47754700 1 mds.gp-ceph-mds1 handle_mds_map i (10.1.2.76:6805/1334) dne in the mdsmap, respawning > myself > 2016-11-04 13:35:15.412886 7fcf47754700 1 mds.gp-ceph-mds1 respawn > > Some of the Asserts > > 2016-11-04 13:26:30.344284 7f03fd42e700 -1 mds/MDSDaemon.cc: In function 'void MDSDaemon::respawn()' thread 7f03fd42e700 time > 2016-11-04 13:26:30.329841 > mds/MDSDaemon.cc: 1132: FAILED assert(0) > > ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x557546692e80] > 2: (MDSDaemon::respawn()+0x73d) [0x5575462789fd] > 3: (MDSDaemon::handle_mds_map(MMDSMap*)+0x1517) [0x557546281667] > 4: (MDSDaemon::handle_core_message(Message*)+0x7f3) [0x557546284a03] > 5: (MDSDaemon::ms_dispatch(Message*)+0x1c3) [0x557546284cf3] > 6: (DispatchQueue::entry()+0xf2b) [0x557546799f6b] > 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x55754667911d] > 8: (()+0x76fa) [0x7f04026b06fa] > 9: (clone()+0x6d) [0x7f0400b71b5d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. This assert might be this issue: http://tracker.ceph.com/issues/17531 However, the exe_path debug line in your log would not indicate that bug. You would see something like: 2016-10-06 15:12:04.933212 7fd94f072700 1 mds.a exe_path /home/pdonnell/ceph/build/bin/ceph-mds (deleted) -- Patrick Donnelly _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com