i All! My CephFS data pool on a 15.2.12 stopped working overnight. I have to much data on there what I planned to migrate today. (Not possible now after I cant get cephfs back up) Something is very off, and I cant pinpoint what. the mds keeps failing May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.343+0200 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.343+0200 7fffe4bb0700 0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to monitors (last acked 24.0037s ago ); MDS internal heartbeat is not healthy! May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.843+0200 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.843+0200 7fffe4bb0700 0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to monitors (last acked 24.5037s ago); MDS internal heartbeat is not healthy! May 03 11:58:41 fh_ceph_a conmon[4835]: 2022-05-03T11:58:41.343+0200 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 May 03 11:58:41 fh_ceph_a conmon[4835]: 2022-05-03T11:58:41.343+0200 7fffe4bb0700 0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to monitors (last acked 25.0037s ago); MDS internal heartbeat is not healthy! [root@fh_ceph_b /]#free -h total used free shared buff/cache available Mem: 251Gi 168Gi 75Gi 4.0Gi 7.1Gi 70Gi Swap: 4.0Gi 0B 4.0Gi [root@fh_ceph_b /]# ceph -s cluster: id: deadbeef-7d25-40ec-abc4-202104a6d54a health: HEALTH_WARN 1 filesystem is degraded 1 nearfull osd(s) 13 pool(s) nearfull services: mon: 3 daemons, quorum fh_ceph_a,fh_ceph_b,fh_ceph_c (age 5M) mgr: fh_ceph_b(active, since 5M), standbys: fh_ceph_a, fh_ceph_c, fh_ceph_d mds: cephfs:2/2 {0=fh_ceph_c=up:resolve,1=fh_ceph_a=up:replay} 1 up:standby osd: 40 osds: 40 up (since 5M), 40 in (since 5M) rgw: 4 daemons active (fh_ceph_a.rgw0, fh_ceph_b.rgw0, fh_ceph_c.rgw0, fh_ceph_d.rgw0) task status: data: pools: 13 pools, 1929 pgs objects: 48.08M objects, 122 TiB usage: 423 TiB used, 215 TiB / 638 TiB avail pgs: 1922 active+clean 7 active+clean+scrubbing+deep io: client: 6.2 MiB/s rd, 2 op/s rd, 0 op/s wr after setting ceph fs set cephfs max_mds 3 and some time the state on one changed at least to resolve (example) [root@fh_ceph_a ~] :date ; podman exec ceph-mon-fh_ceph_a ceph fs status cephfs Tue 3 May 12:14:12 CEST 2022 cephfs - 40 clients ====== RANK STATE MDS ACTIVITY DNS INOS 0 resolve fh_ceph_c 27.0k 27.0k 1 replay fh_ceph_d 0 0 POOL TYPE USED AVAIL cephfs_metadata metadata 48.7G 17.5T cephfs_data data 367T 17.5T STANDBY MDS fh_ceph_b fh_ceph_a MDS version: ceph version 15.2.12 (ce065eabfa5ce81323b009786bdf5bb03127cbe1) octopus (stable) logs of failing mds (journalctl -f -u ceph-mds@$(hostname).service -- since "5 minutes ago") May 03 11:59:37 fh_ceph_b conmon[12777]: -20> 2022-05- 03T11:59:36.068+0200 7fffe63b3700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05- 03T11:59:06.069985+0200) May 03 11:59:37 fh_ceph_b conmon[12777]: -19> 2022-05- 03T11:59:36.085+0200 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 May 03 11:59:37 fh_ceph_b conmon[12777]: -18> 2022-05- 03T11:59:36.085+0200 7fffe4bb0700 0 mds.beacon.fh_ceph_b Skipping beacon heartbeat to monitors (last acked 51.0078s ago); MDS internal heartbeat is not healthy! May 03 11:59:37 fh_ceph_b conmon[12777]: -17> 2022-05- 03T11:59:36.585+0200 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 May 03 11:59:37 fh_ceph_b conmon[12777]: -16> 2022-05- 03T11:59:36.585+0200 7fffe4bb0700 0 mds.beacon.fh_ceph_b Skipping beacon heartbeat to monitors (last acked 51.5078s ago); MDS internal heartbeat is not healthy! May 03 11:59:37 fh_ceph_b conmon[12777]: -15> 2022-05- 03T11:59:37.068+0200 7fffe63b3700 10 monclient: tick May 03 11:59:37 fh_ceph_b conmon[12777]: -14> 2022-05- 03T11:59:37.068+0200 7fffe63b3700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-05- 03T11:59:07.070107+0200) May 03 11:59:37 fh_ceph_b conmon[12777]: -13> 2022-05- 03T11:59:37.085+0200 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 May 03 11:59:37 fh_ceph_b conmon[12777]: -12> 2022-05- 03T11:59:37.085+0200 7fffe4bb0700 0 mds.beacon.fh_ceph_b Skipping beacon heartbeat to monitors (last acked 52.0078s ago); MDS internal heartbeat is not healthy! May 03 11:59:37 fh_ceph_b conmon[12777]: -11> 2022-05- 03T11:59:37.512+0200 7fffe53b1700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 May 03 11:59:37 fh_ceph_b conmon[12777]: -10> 2022-05- 03T11:59:37.512+0200 7fffe53b1700 1 mds.beacon.fh_ceph_b MDS connection to Monitors appears to be laggy; 52.4348s since last acked beacon May 03 11:59:37 fh_ceph_b conmon[12777]: -9> 2022-05- 03T11:59:37.512+0200 7fffe53b1700 1 mds.1.73897 skipping upkeep work because connection to Monitors appears laggy May 03 11:59:37 fh_ceph_b conmon[12777]: -8> 2022-05- 03T11:59:37.512+0200 7fffe73b5700 4 mds.1.73897 handle_osd_map epoch 9541, 0 new blacklist entries May 03 11:59:37 fh_ceph_b conmon[12777]: -7> 2022-05- 03T11:59:37.512+0200 7fffe73b5700 10 monclient: _renew_subs May 03 11:59:37 fh_ceph_b conmon[12777]: -6> 2022-05- 03T11:59:37.512+0200 7fffe73b5700 10 monclient: _send_mon_message to mon.fh_ceph_b at v2:10.251.23.112:3300/0 May 03 11:59:37 fh_ceph_b conmon[12777]: -5> 2022-05- 03T11:59:37.512+0200 7fffe73b5700 4 mgrc ms_handle_reset ms_handle_reset con 0x5555565eb800 May 03 11:59:37 fh_ceph_b conmon[12777]: -4> 2022-05- 03T11:59:37.512+0200 7fffe73b5700 4 mgrc reconnect Terminating session with v2:10.251.23.112:6842/55 May 03 11:59:37 fh_ceph_b conmon[12777]: -3> 2022-05- 03T11:59:37.512+0200 7fffe73b5700 4 mgrc reconnect Starting new session with [v2:10.251.23.112:6842/55,v1:10.251.23.112:6843/55] May 03 11:59:37 fh_ceph_b conmon[12777]: -2> 2022-05- 03T11:59:37.512+0200 7fffe73b5700 1 mds.fh_ceph_b Updating MDS map to version 73898 from mon.1 May 03 11:59:37 fh_ceph_b conmon[12777]: -1> 2022-05- 03T11:59:37.512+0200 7fffe73b5700 1 mds.fh_ceph_b Map removed me [mds.fh_ceph_b{1:479935} state up:replay seq 1 addr [v2:10.251.23.112:6800/3613272291,v1:10.251.23.112:6801/3613272291]] from cluster; respawning! See cluster/monitor logs for details. May 03 11:59:37 fh_ceph_b conmon[12777]: 0> 2022-05- 03T11:59:37.512+0200 7fffe73b5700 1 mds.fh_ceph_b respawn! May 03 11:59:37 fh_ceph_b conmon[12777]: --- logging levels --- May 03 11:59:37 fh_ceph_b conmon[12777]: 0/ 5 none .... May 03 11:59:37 fh_ceph_b conmon[12777]: 99/99 (stderr threshold) May 03 11:59:37 fh_ceph_b conmon[12777]: --- pthread ID / name mapping for recent threads --- May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe03a7700 / May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe0ba8700 / MR_Finisher May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe1baa700 / PQ_Finisher May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe4bb0700 / ceph-mds May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe53b1700 / safe_timer May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe5bb2700 / fn_anonymous May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe63b3700 / safe_timer May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe73b5700 / ms_dispatch May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe83b7700 / admin_socket May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe8bb8700 / service May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe93b9700 / msgr-worker-2 May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe9bba700 / msgr-worker-1 May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffea3bb700 / msgr-worker-0 May 03 11:59:37 fh_ceph_b conmon[12777]: 7ffff7fe0600 / ceph-mds May 03 11:59:37 fh_ceph_b conmon[12777]: max_recent 10000 May 03 11:59:37 fh_ceph_b conmon[12777]: max_new 1000 May 03 11:59:37 fh_ceph_b conmon[12777]: log_file May 03 11:59:37 fh_ceph_b conmon[12777]: --- end dump of recent events --- May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b e: '/usr/bin/ceph-mds' May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b 0: '/usr/bin/ceph-mds' May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b 1: '--cluster' May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b 2: 'freihaus' May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b 3: '--setuser' May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b 4: 'ceph' May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b 5: '--setgroup' May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b 6: 'ceph' May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b 7: '--default-log-to- May 03 11:59:37 fh_ceph_b conmon[12777]: stderr=true' May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b 8: '--err-to-stderr=true' May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b 9: '--default-log-to-file=false' May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b 10: '--foreground' May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b 11: '-i' May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b 12: 'fh_ceph_b' May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b respawning with exe /usr/bin/ceph-mds May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 7fffe73b5700 1 mds.fh_ceph_b exe_path /proc/self/exe May 03 11:59:37 fh_ceph_b conmon[12777]: ignoring --setuser ceph since I am not root May 03 11:59:37 fh_ceph_b conmon[12777]: ignoring --setgroup ceph since I am not root May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.577+0200 7ffff7fe0600 0 ceph version 15.2.12 (ce065eabfa5ce81323b009786bdf5bb03127cbe1) octopus (stable), process ceph-mds, pid 51 May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.577+0200 7ffff7fe0600 1 main not setting numa affinity May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.578+0200 7ffff7fe0600 0 pidfile_write: ignore empty --pid-file May 03 11:59:37 fh_ceph_b conmon[12777]: starting mds.fh_ceph_b at May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.581+0200 7fffe73b5700 1 mds.fh_ceph_b Updating MDS map to version 73900 from mon.1 May 03 11:59:38 fh_ceph_b conmon[12777]: 2022-05-03T11:59:38.397+0200 7fffe73b5700 1 mds.fh_ceph_b Updating MDS map to version 73901 from mon.1 May 03 11:59:38 fh_ceph_b conmon[12777]: 2022-05-03T11:59:38.397+0200 7fffe73b5700 1 mds.fh_ceph_b Monitors have assigned me to become a standby. May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.185+0200 7fffe73b5700 1 mds.fh_ceph_b Updating MDS map to version 73902 from mon.1 May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200 7fffe73b5700 1 mds.1.73902 handle_mds_map i am now mds.1.73902 May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200 7fffe73b5700 1 mds.1.73902 handle_mds_map state change up:boot --> up:replay May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200 7fffe73b5700 1 mds.1.73902 replay_start May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200 7fffe73b5700 1 mds.1.73902 waiting for osdmap 9543 (which blacklists prior instance) May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.288+0200 7fffe0ba8700 0 mds.1.cache creating system inode with ino:0x101 May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.288+0200 7fffe0ba8700 0 mds.1.cache creating system inode with ino:0x1 May 03 13:20:42 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:42.664+0200 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 May 03 13:20:42 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:42.664+0200 7fffe4bb0700 0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to monitors (last acked 6.50192s ago); MDS internal heartbeat is not healthy! May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.164+0200 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.164+0200 7fffe4bb0700 0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to monitors (last acked 7.00191s ago); MDS internal heartbeat is not healthy! May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.663+0200 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.663+0200 7fffe4bb0700 0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to monitors (last acked 7.50091s ago); MDS internal heartbeat is not healthy! May 03 13:20:44 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:44.163+0200 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 May 03 13:20:44 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:44.163+0200 7fffe4bb0700 0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to monitors (last acked 8.0009s ago); MDS internal heartbeat is not healthy! (Sorry when I now double posted this, I think I have to be subscribed to post here what I was not with my last try) _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx