Re: MDS HA failover

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 10 Feb 2017 13:46:29 -0800



This is odd on several levels, and indeed a failover shouldn't take
that long (unless you have a *lot* of metadata that needs to get
loaded into memory, which you won't if running standby-replay). Are
you sure that it's trying to connect to the other MDS, and not a
monitor or OSD on the same host?
If not, can you turn on debugging and reproduce? ("debug mds = 20",
"debug ms = 1")
-Greg

On Wed, Feb 8, 2017 at 1:46 PM, Luke Weber <luke.weber@xxxxxxxxx> wrote:
> Playing around with mds with a hot standby on kraken. When I fail out the
> active mds manually it switches correctly to the standby i.e. ceph mds fail
> <active-mds>
>
> Noticed that when I have two mds servers and I shutdown the active mds
> server it takes 5 minutes for the standby relay to become active(Seems it's
> 20 retries at 15 seconds timeout to the previously active mds). I can't fail
> the active mds though as it's already been removed from the mds map, but the
> hot standby is stuck in replay mode for 5 minutes waiting for the active
> before it gives up and becomes active. Curious if there's a preferred way to
> configure this behavior or force a failover in the event of unexpected
> active failure.
>
> MSD log of standby becoming master:
>
> 2017-02-08 17:25:54.151002 7fa0a1502700  1 mds.0.0 replay_done (as standby)
> 2017-02-08 17:25:55.153022 7fa0a1502700  1 mds.0.0 replay_done (as standby)
> 2017-02-08 17:25:56.154928 7fa0a1502700  1 mds.0.0 replay_done (as standby)
> 2017-02-08 17:25:57.156771 7fa0a1502700  1 mds.0.0 replay_done (as standby)
> 2017-02-08 17:25:58.158700 7fa0a1502700  1 mds.0.0 replay_done (as standby)
> ----- Shutdown active mds (Start to see it reconnecting to active server):
> 2017-02-08 17:26:08.774979 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:26:23.775456 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> ----- 15 Second grace to get an mds map update (mds beacon grace=15)
> 2017-02-08 17:26:25.003332 7fa0a650c700  1 mds.0.132 handle_mds_map i am now
> mds.0.132
> 2017-02-08 17:26:25.003340 7fa0a650c700  1 mds.0.132 handle_mds_map state
> change up:standby-replay --> up:replay
> 2017-02-08 17:26:38.776036 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0baad3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:26:53.776916 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:27:08.777962 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:27:23.777884 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0b82d3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:27:38.778943 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0b82d2000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:27:53.779926 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0b8316800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:28:08.780927 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:28:23.780909 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:28:38.781947 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0b82d3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:28:53.782075 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0b82d2000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:29:08.782916 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0b8315000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:29:23.783476 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0b8315000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:29:38.784445 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:29:53.784934 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:30:08.785959 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0b82d3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:30:23.786921 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0b82d2000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:30:38.786923 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:30:53.788035 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2017-02-08 17:31:08.788730 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >>
> - conn(0x7fa0b8315000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> [2017-02-08 17:31:15.393349 7fa0a1502700  1 mds.0.132 replay_done (as
> standby)
> 2017-02-08 17:31:15.393353 7fa0a1502700  1 mds.0.132 standby_replay_restart
> (final takeover pass)
> 2017-02-08 17:31:15.397825 7fa0a1502700  1 mds.0.132 replay_done
> 2017-02-08 17:31:15.397832 7fa0a1502700  1 mds.0.132 making mds journal
> writeable
> 2017-02-08 17:31:16.163297 7fa0a650c700  1 mds.0.132 handle_mds_map i am now
> mds.0.132
> 2017-02-08 17:31:16.163303 7fa0a650c700  1 mds.0.132 handle_mds_map state
> change up:replay --> up:reconnect
> 2017-02-08 17:31:16.163312 7fa0a650c700  1 mds.0.132 reconnect_start
> 2017-02-08 17:31:16.163314 7fa0a650c700  1 mds.0.132 reopen_log
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com