This is odd on several levels, and indeed a failover shouldn't take that long (unless you have a *lot* of metadata that needs to get loaded into memory, which you won't if running standby-replay). Are you sure that it's trying to connect to the other MDS, and not a monitor or OSD on the same host? If not, can you turn on debugging and reproduce? ("debug mds = 20", "debug ms = 1") -Greg On Wed, Feb 8, 2017 at 1:46 PM, Luke Weber <luke.weber@xxxxxxxxx> wrote: > Playing around with mds with a hot standby on kraken. When I fail out the > active mds manually it switches correctly to the standby i.e. ceph mds fail > <active-mds> > > Noticed that when I have two mds servers and I shutdown the active mds > server it takes 5 minutes for the standby relay to become active(Seems it's > 20 retries at 15 seconds timeout to the previously active mds). I can't fail > the active mds though as it's already been removed from the mds map, but the > hot standby is stuck in replay mode for 5 minutes waiting for the active > before it gives up and becomes active. Curious if there's a preferred way to > configure this behavior or force a failover in the event of unexpected > active failure. > > MSD log of standby becoming master: > > 2017-02-08 17:25:54.151002 7fa0a1502700 1 mds.0.0 replay_done (as standby) > 2017-02-08 17:25:55.153022 7fa0a1502700 1 mds.0.0 replay_done (as standby) > 2017-02-08 17:25:56.154928 7fa0a1502700 1 mds.0.0 replay_done (as standby) > 2017-02-08 17:25:57.156771 7fa0a1502700 1 mds.0.0 replay_done (as standby) > 2017-02-08 17:25:58.158700 7fa0a1502700 1 mds.0.0 replay_done (as standby) > ----- Shutdown active mds (Start to see it reconnecting to active server): > 2017-02-08 17:26:08.774979 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:26:23.775456 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > ----- 15 Second grace to get an mds map update (mds beacon grace=15) > 2017-02-08 17:26:25.003332 7fa0a650c700 1 mds.0.132 handle_mds_map i am now > mds.0.132 > 2017-02-08 17:26:25.003340 7fa0a650c700 1 mds.0.132 handle_mds_map state > change up:standby-replay --> up:replay > 2017-02-08 17:26:38.776036 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0baad3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:26:53.776916 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:27:08.777962 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:27:23.777884 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0b82d3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:27:38.778943 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0b82d2000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:27:53.779926 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0b8316800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:28:08.780927 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:28:23.780909 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:28:38.781947 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0b82d3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:28:53.782075 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0b82d2000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:29:08.782916 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0b8315000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:29:23.783476 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0b8315000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:29:38.784445 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:29:53.784934 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:30:08.785959 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0b82d3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:30:23.786921 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0b82d2000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:30:38.786923 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:30:53.788035 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2017-02-08 17:31:08.788730 7fa0a9483700 0 -- 172.20.1.139:6800/255206595 >> > - conn(0x7fa0b8315000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > [2017-02-08 17:31:15.393349 7fa0a1502700 1 mds.0.132 replay_done (as > standby) > 2017-02-08 17:31:15.393353 7fa0a1502700 1 mds.0.132 standby_replay_restart > (final takeover pass) > 2017-02-08 17:31:15.397825 7fa0a1502700 1 mds.0.132 replay_done > 2017-02-08 17:31:15.397832 7fa0a1502700 1 mds.0.132 making mds journal > writeable > 2017-02-08 17:31:16.163297 7fa0a650c700 1 mds.0.132 handle_mds_map i am now > mds.0.132 > 2017-02-08 17:31:16.163303 7fa0a650c700 1 mds.0.132 handle_mds_map state > change up:replay --> up:reconnect > 2017-02-08 17:31:16.163312 7fa0a650c700 1 mds.0.132 reconnect_start > 2017-02-08 17:31:16.163314 7fa0a650c700 1 mds.0.132 reopen_log > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com