please set debug_mds=10, and try again On Tue, Apr 2, 2019 at 1:01 PM Albert Yue <transuranium.yue@xxxxxxxxx> wrote: > > Hi, > > This happens after we restart the active MDS, and somehow the standby MDS daemon cannot take over successfully and is stuck at up:replaying. It is showing the following log. Any idea on how to fix this? > > 2019-04-02 12:54:00.985079 7f6f70670700 1 mds.WXS0023 respawn > 2019-04-02 12:54:00.985095 7f6f70670700 1 mds.WXS0023 e: '/usr/bin/ceph-mds' > 2019-04-02 12:54:00.985097 7f6f70670700 1 mds.WXS0023 0: '/usr/bin/ceph-mds' > 2019-04-02 12:54:00.985099 7f6f70670700 1 mds.WXS0023 1: '-f' > 2019-04-02 12:54:00.985100 7f6f70670700 1 mds.WXS0023 2: '--cluster' > 2019-04-02 12:54:00.985101 7f6f70670700 1 mds.WXS0023 3: 'ceph' > 2019-04-02 12:54:00.985102 7f6f70670700 1 mds.WXS0023 4: '--id' > 2019-04-02 12:54:00.985103 7f6f70670700 1 mds.WXS0023 5: 'WXS0023' > 2019-04-02 12:54:00.985104 7f6f70670700 1 mds.WXS0023 6: '--setuser' > 2019-04-02 12:54:00.985105 7f6f70670700 1 mds.WXS0023 7: 'ceph' > 2019-04-02 12:54:00.985106 7f6f70670700 1 mds.WXS0023 8: '--setgroup' > 2019-04-02 12:54:00.985107 7f6f70670700 1 mds.WXS0023 9: 'ceph' > 2019-04-02 12:54:00.985142 7f6f70670700 1 mds.WXS0023 respawning with exe /usr/bin/ceph-mds > 2019-04-02 12:54:00.985145 7f6f70670700 1 mds.WXS0023 exe_path /proc/self/exe > 2019-04-02 12:54:02.139272 7ff8a739a200 0 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable), process (unknown), pid 3369045 > 2019-04-02 12:54:02.141565 7ff8a739a200 0 pidfile_write: ignore empty --pid-file > 2019-04-02 12:54:06.675604 7ff8a0ecd700 1 mds.WXS0023 handle_mds_map standby > 2019-04-02 12:54:26.114757 7ff8a0ecd700 1 mds.0.136021 handle_mds_map i am now mds.0.136021 > 2019-04-02 12:54:26.114764 7ff8a0ecd700 1 mds.0.136021 handle_mds_map state change up:boot --> up:replay > 2019-04-02 12:54:26.114779 7ff8a0ecd700 1 mds.0.136021 replay_start > 2019-04-02 12:54:26.114784 7ff8a0ecd700 1 mds.0.136021 recovery set is > 2019-04-02 12:54:26.114789 7ff8a0ecd700 1 mds.0.136021 waiting for osdmap 14333 (which blacklists prior instance) > 2019-04-02 12:54:26.141256 7ff89a6c0700 0 mds.0.cache creating system inode with ino:0x100 > 2019-04-02 12:54:26.141454 7ff89a6c0700 0 mds.0.cache creating system inode with ino:0x1 > 2019-04-02 12:54:50.148022 7ff89dec7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 > 2019-04-02 12:54:50.148049 7ff89dec7700 1 mds.beacon.WXS0023 _send skipping beacon, heartbeat map not healthy > 2019-04-02 12:54:52.143637 7ff8a1ecf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 > 2019-04-02 12:54:54.148122 7ff89dec7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 > 2019-04-02 12:54:54.148157 7ff89dec7700 1 mds.beacon.WXS0023 _send skipping beacon, heartbeat map not healthy > 2019-04-02 12:54:57.143730 7ff8a1ecf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 > 2019-04-02 12:54:58.148239 7ff89dec7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 > 2019-04-02 12:54:58.148249 7ff89dec7700 1 mds.beacon.WXS0023 _send skipping beacon, heartbeat map not healthy > 2019-04-02 12:55:02.143819 7ff8a1ecf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 > 2019-04-02 12:55:02.148311 7ff89dec7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 > 2019-04-02 12:55:02.148330 7ff89dec7700 1 mds.beacon.WXS0023 _send skipping beacon, heartbeat map not healthy > 2019-04-02 12:55:06.148393 7ff89dec7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 > 2019-04-02 12:55:06.148416 7ff89dec7700 1 mds.beacon.WXS0023 _send skipping beacon, heartbeat map not healthy > 2019-04-02 12:55:07.143914 7ff8a1ecf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 > 2019-04-02 12:55:07.615602 7ff89e6c8700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 > 2019-04-02 12:55:07.618294 7ff8a0ecd700 1 mds.WXS0023 map removed me (mds.-1 gid:7441294) from cluster due to lost contact; respawning > 2019-04-02 12:55:07.618296 7ff8a0ecd700 1 mds.WXS0023 respawn > 2019-04-02 12:55:07.618314 7ff8a0ecd700 1 mds.WXS0023 e: '/usr/bin/ceph-mds' > 2019-04-02 12:55:07.618318 7ff8a0ecd700 1 mds.WXS0023 0: '/usr/bin/ceph-mds' > 2019-04-02 12:55:07.618319 7ff8a0ecd700 1 mds.WXS0023 1: '-f' > 2019-04-02 12:55:07.618320 7ff8a0ecd700 1 mds.WXS0023 2: '--cluster' > 2019-04-02 12:55:07.618320 7ff8a0ecd700 1 mds.WXS0023 3: 'ceph' > 2019-04-02 12:55:07.618321 7ff8a0ecd700 1 mds.WXS0023 4: '--id' > 2019-04-02 12:55:07.618321 7ff8a0ecd700 1 mds.WXS0023 5: 'WXS0023' > 2019-04-02 12:55:07.618322 7ff8a0ecd700 1 mds.WXS0023 6: '--setuser' > 2019-04-02 12:55:07.618323 7ff8a0ecd700 1 mds.WXS0023 7: 'ceph' > 2019-04-02 12:55:07.618323 7ff8a0ecd700 1 mds.WXS0023 8: '--setgroup' > 2019-04-02 12:55:07.618325 7ff8a0ecd700 1 mds.WXS0023 9: 'ceph' > 2019-04-02 12:55:07.618352 7ff8a0ecd700 1 mds.WXS0023 respawning with exe /usr/bin/ceph-mds > 2019-04-02 12:55:07.618353 7ff8a0ecd700 1 mds.WXS0023 exe_path /proc/self/exe > 2019-04-02 12:55:09.174064 7f4c596be200 0 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable), process (unknown), pid 3369045 > 2019-04-02 12:55:09.176292 7f4c596be200 0 pidfile_write: ignore empty --pid-file > 2019-04-02 12:55:13.296469 7f4c531f1700 1 mds.WXS0023 handle_mds_map standby > > > Thanks! > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com