MDS HA failover

Luke Weber <luke.weber@xxxxxxxxx> · Wed, 8 Feb 2017 13:46:40 -0800

Playing around with mds with a hot standby on kraken. When I fail out the active mds manually it switches correctly to the standby i.e. ceph mds fail <active-mds>
Noticed that when I have two mds servers and I shutdown the active mds server it takes 5 minutes for the standby relay to become active(Seems it's 20 retries at 15 seconds timeout to the previously active mds). I can't fail the active mds though as it's already been removed from the mds map, but the hot standby is stuck in replay mode for 5 minutes waiting for the active before it gives up and becomes active. Curious if there's a preferred way to configure this behavior or force a failover in the event of unexpected active failure.

MSD log of standby becoming master:

2017-02-08 17:25:54.151002 7fa0a1502700  1 mds.0.0 replay_done (as standby)
2017-02-08 17:25:55.153022 7fa0a1502700  1 mds.0.0 replay_done (as standby)
2017-02-08 17:25:56.154928 7fa0a1502700  1 mds.0.0 replay_done (as standby)
2017-02-08 17:25:57.156771 7fa0a1502700  1 mds.0.0 replay_done (as standby)
2017-02-08 17:25:58.158700 7fa0a1502700  1 mds.0.0 replay_done (as standby)
----- Shutdown active mds (Start to see it reconnecting to active server):
2017-02-08 17:26:08.774979 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:26:23.775456 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
----- 15 Second grace to get an mds map update (mds beacon grace=15)
2017-02-08 17:26:25.003332 7fa0a650c700  1 mds.0.132 handle_mds_map i am now mds.0.132
2017-02-08 17:26:25.003340 7fa0a650c700  1 mds.0.132 handle_mds_map state change up:standby-replay --> up:replay
2017-02-08 17:26:38.776036 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0baad3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:26:53.776916 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:27:08.777962 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:27:23.777884 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0b82d3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:27:38.778943 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0b82d2000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:27:53.779926 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0b8316800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:28:08.780927 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:28:23.780909 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:28:38.781947 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0b82d3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:28:53.782075 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0b82d2000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:29:08.782916 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0b8315000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:29:23.783476 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0b8315000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:29:38.784445 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:29:53.784934 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:30:08.785959 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0b82d3800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:30:23.786921 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0b82d2000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:30:38.786923 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0baad6800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:30:53.788035 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0baad5000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-02-08 17:31:08.788730 7fa0a9483700  0 -- 172.20.1.139:6800/255206595 >> - conn(0x7fa0b8315000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
[2017-02-08 17:31:15.393349 7fa0a1502700  1 mds.0.132 replay_done (as standby)
2017-02-08 17:31:15.393353 7fa0a1502700  1 mds.0.132 standby_replay_restart (final takeover pass)
2017-02-08 17:31:15.397825 7fa0a1502700  1 mds.0.132 replay_done
2017-02-08 17:31:15.397832 7fa0a1502700  1 mds.0.132 making mds journal writeable
2017-02-08 17:31:16.163297 7fa0a650c700  1 mds.0.132 handle_mds_map i am now mds.0.132
2017-02-08 17:31:16.163303 7fa0a650c700  1 mds.0.132 handle_mds_map state change up:replay --> up:reconnect
2017-02-08 17:31:16.163312 7fa0a650c700  1 mds.0.132 reconnect_start
2017-02-08 17:31:16.163314 7fa0a650c700  1 mds.0.132 reopen_log
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com