Re: MDS stuck at replaying status

"Yan, Zheng" <ukernel@xxxxxxxxx> · Tue, 2 Apr 2019 15:03:36 +0800



please set debug_mds=10, and try again

On Tue, Apr 2, 2019 at 1:01 PM Albert Yue <transuranium.yue@xxxxxxxxx> wrote:
>
> Hi,
>
> This happens after we restart the active MDS, and somehow the standby MDS daemon cannot take over successfully and is stuck at up:replaying. It is showing the following log. Any idea on how to fix this?
>
> 2019-04-02 12:54:00.985079 7f6f70670700  1 mds.WXS0023 respawn
> 2019-04-02 12:54:00.985095 7f6f70670700  1 mds.WXS0023  e: '/usr/bin/ceph-mds'
> 2019-04-02 12:54:00.985097 7f6f70670700  1 mds.WXS0023  0: '/usr/bin/ceph-mds'
> 2019-04-02 12:54:00.985099 7f6f70670700  1 mds.WXS0023  1: '-f'
> 2019-04-02 12:54:00.985100 7f6f70670700  1 mds.WXS0023  2: '--cluster'
> 2019-04-02 12:54:00.985101 7f6f70670700  1 mds.WXS0023  3: 'ceph'
> 2019-04-02 12:54:00.985102 7f6f70670700  1 mds.WXS0023  4: '--id'
> 2019-04-02 12:54:00.985103 7f6f70670700  1 mds.WXS0023  5: 'WXS0023'
> 2019-04-02 12:54:00.985104 7f6f70670700  1 mds.WXS0023  6: '--setuser'
> 2019-04-02 12:54:00.985105 7f6f70670700  1 mds.WXS0023  7: 'ceph'
> 2019-04-02 12:54:00.985106 7f6f70670700  1 mds.WXS0023  8: '--setgroup'
> 2019-04-02 12:54:00.985107 7f6f70670700  1 mds.WXS0023  9: 'ceph'
> 2019-04-02 12:54:00.985142 7f6f70670700  1 mds.WXS0023 respawning with exe /usr/bin/ceph-mds
> 2019-04-02 12:54:00.985145 7f6f70670700  1 mds.WXS0023  exe_path /proc/self/exe
> 2019-04-02 12:54:02.139272 7ff8a739a200  0 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable), process (unknown), pid 3369045
> 2019-04-02 12:54:02.141565 7ff8a739a200  0 pidfile_write: ignore empty --pid-file
> 2019-04-02 12:54:06.675604 7ff8a0ecd700  1 mds.WXS0023 handle_mds_map standby
> 2019-04-02 12:54:26.114757 7ff8a0ecd700  1 mds.0.136021 handle_mds_map i am now mds.0.136021
> 2019-04-02 12:54:26.114764 7ff8a0ecd700  1 mds.0.136021 handle_mds_map state change up:boot --> up:replay
> 2019-04-02 12:54:26.114779 7ff8a0ecd700  1 mds.0.136021 replay_start
> 2019-04-02 12:54:26.114784 7ff8a0ecd700  1 mds.0.136021  recovery set is
> 2019-04-02 12:54:26.114789 7ff8a0ecd700  1 mds.0.136021  waiting for osdmap 14333 (which blacklists prior instance)
> 2019-04-02 12:54:26.141256 7ff89a6c0700  0 mds.0.cache creating system inode with ino:0x100
> 2019-04-02 12:54:26.141454 7ff89a6c0700  0 mds.0.cache creating system inode with ino:0x1
> 2019-04-02 12:54:50.148022 7ff89dec7700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
> 2019-04-02 12:54:50.148049 7ff89dec7700  1 mds.beacon.WXS0023 _send skipping beacon, heartbeat map not healthy
> 2019-04-02 12:54:52.143637 7ff8a1ecf700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
> 2019-04-02 12:54:54.148122 7ff89dec7700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
> 2019-04-02 12:54:54.148157 7ff89dec7700  1 mds.beacon.WXS0023 _send skipping beacon, heartbeat map not healthy
> 2019-04-02 12:54:57.143730 7ff8a1ecf700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
> 2019-04-02 12:54:58.148239 7ff89dec7700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
> 2019-04-02 12:54:58.148249 7ff89dec7700  1 mds.beacon.WXS0023 _send skipping beacon, heartbeat map not healthy
> 2019-04-02 12:55:02.143819 7ff8a1ecf700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
> 2019-04-02 12:55:02.148311 7ff89dec7700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
> 2019-04-02 12:55:02.148330 7ff89dec7700  1 mds.beacon.WXS0023 _send skipping beacon, heartbeat map not healthy
> 2019-04-02 12:55:06.148393 7ff89dec7700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
> 2019-04-02 12:55:06.148416 7ff89dec7700  1 mds.beacon.WXS0023 _send skipping beacon, heartbeat map not healthy
> 2019-04-02 12:55:07.143914 7ff8a1ecf700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
> 2019-04-02 12:55:07.615602 7ff89e6c8700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
> 2019-04-02 12:55:07.618294 7ff8a0ecd700  1 mds.WXS0023 map removed me (mds.-1 gid:7441294) from cluster due to lost contact; respawning
> 2019-04-02 12:55:07.618296 7ff8a0ecd700  1 mds.WXS0023 respawn
> 2019-04-02 12:55:07.618314 7ff8a0ecd700  1 mds.WXS0023  e: '/usr/bin/ceph-mds'
> 2019-04-02 12:55:07.618318 7ff8a0ecd700  1 mds.WXS0023  0: '/usr/bin/ceph-mds'
> 2019-04-02 12:55:07.618319 7ff8a0ecd700  1 mds.WXS0023  1: '-f'
> 2019-04-02 12:55:07.618320 7ff8a0ecd700  1 mds.WXS0023  2: '--cluster'
> 2019-04-02 12:55:07.618320 7ff8a0ecd700  1 mds.WXS0023  3: 'ceph'
> 2019-04-02 12:55:07.618321 7ff8a0ecd700  1 mds.WXS0023  4: '--id'
> 2019-04-02 12:55:07.618321 7ff8a0ecd700  1 mds.WXS0023  5: 'WXS0023'
> 2019-04-02 12:55:07.618322 7ff8a0ecd700  1 mds.WXS0023  6: '--setuser'
> 2019-04-02 12:55:07.618323 7ff8a0ecd700  1 mds.WXS0023  7: 'ceph'
> 2019-04-02 12:55:07.618323 7ff8a0ecd700  1 mds.WXS0023  8: '--setgroup'
> 2019-04-02 12:55:07.618325 7ff8a0ecd700  1 mds.WXS0023  9: 'ceph'
> 2019-04-02 12:55:07.618352 7ff8a0ecd700  1 mds.WXS0023 respawning with exe /usr/bin/ceph-mds
> 2019-04-02 12:55:07.618353 7ff8a0ecd700  1 mds.WXS0023  exe_path /proc/self/exe
> 2019-04-02 12:55:09.174064 7f4c596be200  0 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable), process (unknown), pid 3369045
> 2019-04-02 12:55:09.176292 7f4c596be200  0 pidfile_write: ignore empty --pid-file
> 2019-04-02 12:55:13.296469 7f4c531f1700  1 mds.WXS0023 handle_mds_map standby
>
>
> Thanks!
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com