Re: MDS Problems

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Fri, 4 Nov 2016 10:36:37 -0400

Hello Nick,

On Fri, Nov 4, 2016 at 9:54 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> I upgraded to 10.2.3 today and after restarting the MDS, the same or very similar problem occurred. I didn't see any of the symlink
> errors, so I think that was fixed in the upgrade but I was still seeing looping and crashing until I killed all clients and evicted
> sessions.
> [...]
> 2016-11-04 13:35:11.255310 7f945df1c700  1 mds.gp-ceph-mds1 handle_mds_map i (10.1.2.76:6800/8619) dne in the mdsmap, respawning
> myself
> 2016-11-04 13:35:11.255330 7f945df1c700  1 mds.gp-ceph-mds1 respawn
> 2016-11-04 13:35:11.255339 7f945df1c700  1 mds.gp-ceph-mds1  e: '/usr/bin/ceph-mds'
> 2016-11-04 13:35:11.255346 7f945df1c700  1 mds.gp-ceph-mds1  0: '/usr/bin/ceph-mds'
> 2016-11-04 13:35:11.255353 7f945df1c700  1 mds.gp-ceph-mds1  1: '-f'
> 2016-11-04 13:35:11.255359 7f945df1c700  1 mds.gp-ceph-mds1  2: '--cluster'
> 2016-11-04 13:35:11.255366 7f945df1c700  1 mds.gp-ceph-mds1  3: 'ceph'
> 2016-11-04 13:35:11.255372 7f945df1c700  1 mds.gp-ceph-mds1  4: '--id'
> 2016-11-04 13:35:11.255381 7f945df1c700  1 mds.gp-ceph-mds1  5: 'gp-ceph-mds1'
> 2016-11-04 13:35:11.255388 7f945df1c700  1 mds.gp-ceph-mds1  6: '--setuser'
> 2016-11-04 13:35:11.255395 7f945df1c700  1 mds.gp-ceph-mds1  7: 'ceph'
> 2016-11-04 13:35:11.255401 7f945df1c700  1 mds.gp-ceph-mds1  8: '--setgroup'
> 2016-11-04 13:35:11.255407 7f945df1c700  1 mds.gp-ceph-mds1  9: 'ceph'
> 2016-11-04 13:35:11.255437 7f945df1c700  1 mds.gp-ceph-mds1  exe_path /usr/bin/ceph-mds
> 2016-11-04 13:35:11.259542 7fcf47754700  1 mds.0.1371026 handle_mds_map i am now mds.0.1371026
> 2016-11-04 13:35:11.259549 7fcf47754700  1 mds.0.1371026 handle_mds_map state change up:boot --> up:replay
> 2016-11-04 13:35:11.259563 7fcf47754700  1 mds.0.1371026 replay_start
> 2016-11-04 13:35:11.259565 7fcf47754700  1 mds.0.1371026  recovery set is
> 2016-11-04 13:35:11.259569 7fcf47754700  1 mds.0.1371026  waiting for osdmap 1382368 (which blacklists prior instance)
> 2016-11-04 13:35:11.266088 7fcf42447700  0 mds.0.cache creating system inode with ino:100
> 2016-11-04 13:35:11.266207 7fcf42447700  0 mds.0.cache creating system inode with ino:1
> 2016-11-04 13:35:11.270580 7fe46b86c200  0 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b), process ceph-mds, pid
> 8619
> 2016-11-04 13:35:11.271530 7fe46b86c200  0 pidfile_write: ignore empty --pid-file
> 2016-11-04 13:35:11.507023 7fcf40637700  1 mds.0.1371026 replay_done
> 2016-11-04 13:35:11.507077 7fcf40637700  1 mds.0.1371026 making mds journal writeable
> 2016-11-04 13:35:12.366251 7fcf47754700  1 mds.0.1371026 handle_mds_map i am now mds.0.1371026
> 2016-11-04 13:35:12.366260 7fcf47754700  1 mds.0.1371026 handle_mds_map state change up:replay --> up:reconnect
> 2016-11-04 13:35:12.366268 7fcf47754700  1 mds.0.1371026 reconnect_start
> 2016-11-04 13:35:12.366269 7fcf47754700  1 mds.0.1371026 reopen_log
> 2016-11-04 13:35:12.366284 7fcf47754700  1 mds.0.server reconnect_clients -- 9 sessions
> 2016-11-04 13:35:12.366317 7fcf47754700  0 log_channel(cluster) log [DBG] : reconnect by client.2623964 10.1.103.231:0/2164520513
> after 0.000010
> 2016-11-04 13:35:12.366476 7fcf47754700  0 log_channel(cluster) log [DBG] : reconnect by client.2719692 10.1.103.233:0/174968170
> after 0.000184
> 2016-11-04 13:35:12.367048 7fcf47754700  0 log_channel(cluster) log [DBG] : reconnect by client.2719695 10.1.103.232:0/1211624383
> after 0.000752
> 2016-11-04 13:35:12.371509 7fcf47754700  0 log_channel(cluster) log [DBG] : reconnect by client.2765641 10.1.103.231:0/1622816431
> after 0.005214
> 2016-11-04 13:35:12.377333 7fcf47754700  0 log_channel(cluster) log [DBG] : reconnect by client.2623967 10.1.103.232:0/342252210
> after 0.011038
> 2016-11-04 13:35:12.377782 7fcf47754700  0 log_channel(cluster) log [DBG] : reconnect by client.2765635 10.1.103.231:0/2313865568
> after 0.011485
> 2016-11-04 13:35:12.396777 7fcf47754700  0 log_channel(cluster) log [DBG] : reconnect by client.2719689 10.1.103.232:0/3215599620
> after 0.030460
> 2016-11-04 13:35:12.418640 7fcf47754700  0 log_channel(cluster) log [DBG] : reconnect by client.2624111 10.1.106.3:0/2929007149
> after 0.052235
> 2016-11-04 13:35:15.407765 7fcf3de16700  0 -- 10.1.2.76:6805/1334 >> 10.1.103.232:0/3215599620 pipe(0x55d529479400 sd=1787 :6805 s=2
> pgs=6645 cs=1 l=0 c=0x55d529954a80).fault with nothing to send, going to standby
> 2016-11-04 13:35:15.409659 7fe465752700  1 mds.gp-ceph-mds1 handle_mds_map standby
> 2016-11-04 13:35:15.412866 7fcf47754700  1 mds.gp-ceph-mds1 handle_mds_map i (10.1.2.76:6805/1334) dne in the mdsmap, respawning
> myself
> 2016-11-04 13:35:15.412886 7fcf47754700  1 mds.gp-ceph-mds1 respawn
>
> Some of the Asserts
>
> 2016-11-04 13:26:30.344284 7f03fd42e700 -1 mds/MDSDaemon.cc: In function 'void MDSDaemon::respawn()' thread 7f03fd42e700 time
> 2016-11-04 13:26:30.329841
> mds/MDSDaemon.cc: 1132: FAILED assert(0)
>
>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x557546692e80]
>  2: (MDSDaemon::respawn()+0x73d) [0x5575462789fd]
>  3: (MDSDaemon::handle_mds_map(MMDSMap*)+0x1517) [0x557546281667]
>  4: (MDSDaemon::handle_core_message(Message*)+0x7f3) [0x557546284a03]
>  5: (MDSDaemon::ms_dispatch(Message*)+0x1c3) [0x557546284cf3]
>  6: (DispatchQueue::entry()+0xf2b) [0x557546799f6b]
>  7: (DispatchQueue::DispatchThread::entry()+0xd) [0x55754667911d]
>  8: (()+0x76fa) [0x7f04026b06fa]
>  9: (clone()+0x6d) [0x7f0400b71b5d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This assert might be this issue: http://tracker.ceph.com/issues/17531

However, the exe_path debug line in your log would not indicate that
bug. You would see something like:

2016-10-06 15:12:04.933212 7fd94f072700  1 mds.a  exe_path
/home/pdonnell/ceph/build/bin/ceph-mds (deleted)

-- 
Patrick Donnelly
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com