Okay, so you started out with 2 active MDSes and then they failed on a restart? And in an effort to fix it you changed max_mds to 3? (That was a bad idea, but *probably* didn't actually hurt anything this time — adding new work to scale out a system which already can't turn on just overloads it more!) The logs here are not very apparent about what's going on. You should set "debug ms = 1" and "debug mds = 20" on your MDSes, restart them all, and then use ceph-post-file to upload them for analysis. The logs here are very sparse and if the MDS internal heartbeat is unhealthy there's something wrong in the depths that unfortunately isn't being output in what's visible. -Greg On Tue, May 3, 2022 at 1:25 PM Wagner-Kerschbaumer <wagner-kerschbaumer@xxxxxxxxxxxxx> wrote: > > i All! > My CephFS data pool on a 15.2.12 stopped working overnight. > I have to much data on there what I planned to migrate today. (Not > possible now after I cant get cephfs back up) > > Something is very off, and I cant pinpoint what. the mds keeps failing > > May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.343+0200 > 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after > 15 > May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.343+0200 > 7fffe4bb0700 0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to > monitors (last acked 24.0037s ago > ); MDS internal heartbeat is not healthy! > May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.843+0200 > 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after > 15 > May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.843+0200 > 7fffe4bb0700 0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to > monitors (last acked 24.5037s ago); MDS internal heartbeat is not > healthy! > May 03 11:58:41 fh_ceph_a conmon[4835]: 2022-05-03T11:58:41.343+0200 > 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after > 15 > May 03 11:58:41 fh_ceph_a conmon[4835]: 2022-05-03T11:58:41.343+0200 > 7fffe4bb0700 0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to > monitors (last acked 25.0037s ago); MDS internal heartbeat is not > healthy! > [root@fh_ceph_b /]#free -h > total used free shared buff/cache > available > Mem: 251Gi 168Gi 75Gi 4.0Gi 7.1Gi > 70Gi > Swap: 4.0Gi 0B 4.0Gi > [root@fh_ceph_b /]# ceph -s > cluster: > id: deadbeef-7d25-40ec-abc4-202104a6d54a > health: HEALTH_WARN > 1 filesystem is degraded > 1 nearfull osd(s) > 13 pool(s) nearfull > > services: > mon: 3 daemons, quorum fh_ceph_a,fh_ceph_b,fh_ceph_c (age 5M) > mgr: fh_ceph_b(active, since 5M), standbys: fh_ceph_a, fh_ceph_c, > fh_ceph_d > mds: cephfs:2/2 {0=fh_ceph_c=up:resolve,1=fh_ceph_a=up:replay} 1 > up:standby > osd: 40 osds: 40 up (since 5M), 40 in (since 5M) > rgw: 4 daemons active (fh_ceph_a.rgw0, fh_ceph_b.rgw0, > fh_ceph_c.rgw0, fh_ceph_d.rgw0) > > task status: > > data: > pools: 13 pools, 1929 pgs > objects: 48.08M objects, 122 TiB > usage: 423 TiB used, 215 TiB / 638 TiB avail > pgs: 1922 active+clean > 7 active+clean+scrubbing+deep > > io: > client: 6.2 MiB/s rd, 2 op/s rd, 0 op/s wr > > after setting ceph fs set cephfs max_mds 3 and some time the state on > one changed at least to resolve > > (example) > [root@fh_ceph_a ~] :date ; podman exec ceph-mon-fh_ceph_a ceph fs > status cephfs > Tue 3 May 12:14:12 CEST 2022 > cephfs - 40 clients > ====== > RANK STATE MDS ACTIVITY DNS INOS > 0 resolve fh_ceph_c 27.0k 27.0k > 1 replay fh_ceph_d 0 0 > POOL TYPE USED AVAIL > cephfs_metadata metadata 48.7G 17.5T > cephfs_data data 367T 17.5T > STANDBY MDS > fh_ceph_b > fh_ceph_a > MDS version: ceph version 15.2.12 > (ce065eabfa5ce81323b009786bdf5bb03127cbe1) octopus (stable) > > logs of failing mds (journalctl -f -u ceph-mds@$(hostname).service -- > since "5 minutes ago") > May 03 11:59:37 fh_ceph_b conmon[12777]: -20> 2022-05- > 03T11:59:36.068+0200 7fffe63b3700 10 monclient: _check_auth_rotating > have uptodate secrets (they expire after 2022-05- > 03T11:59:06.069985+0200) > May 03 11:59:37 fh_ceph_b conmon[12777]: -19> 2022-05- > 03T11:59:36.085+0200 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' > had timed out after 15 > May 03 11:59:37 fh_ceph_b conmon[12777]: -18> 2022-05- > 03T11:59:36.085+0200 7fffe4bb0700 0 mds.beacon.fh_ceph_b Skipping > beacon heartbeat to monitors (last acked 51.0078s ago); MDS internal > heartbeat is not healthy! > May 03 11:59:37 fh_ceph_b conmon[12777]: -17> 2022-05- > 03T11:59:36.585+0200 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' > had timed out after 15 > May 03 11:59:37 fh_ceph_b conmon[12777]: -16> 2022-05- > 03T11:59:36.585+0200 7fffe4bb0700 0 mds.beacon.fh_ceph_b Skipping > beacon heartbeat to monitors (last acked 51.5078s ago); MDS internal > heartbeat is not healthy! > May 03 11:59:37 fh_ceph_b conmon[12777]: -15> 2022-05- > 03T11:59:37.068+0200 7fffe63b3700 10 monclient: tick > May 03 11:59:37 fh_ceph_b conmon[12777]: -14> 2022-05- > 03T11:59:37.068+0200 7fffe63b3700 10 monclient: _check_auth_rotating > have uptodate secrets (they expire after 2022-05- > 03T11:59:07.070107+0200) > May 03 11:59:37 fh_ceph_b conmon[12777]: -13> 2022-05- > 03T11:59:37.085+0200 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' > had timed out after 15 > May 03 11:59:37 fh_ceph_b conmon[12777]: -12> 2022-05- > 03T11:59:37.085+0200 7fffe4bb0700 0 mds.beacon.fh_ceph_b Skipping > beacon heartbeat to monitors (last acked 52.0078s ago); MDS internal > heartbeat is not healthy! > May 03 11:59:37 fh_ceph_b conmon[12777]: -11> 2022-05- > 03T11:59:37.512+0200 7fffe53b1700 1 heartbeat_map reset_timeout > 'MDSRank' had timed out after 15 > May 03 11:59:37 fh_ceph_b conmon[12777]: -10> 2022-05- > 03T11:59:37.512+0200 7fffe53b1700 1 mds.beacon.fh_ceph_b MDS > connection to Monitors appears to be laggy; 52.4348s since last acked > beacon > May 03 11:59:37 fh_ceph_b conmon[12777]: -9> 2022-05- > 03T11:59:37.512+0200 7fffe53b1700 1 mds.1.73897 skipping upkeep work > because connection to Monitors appears laggy > May 03 11:59:37 fh_ceph_b conmon[12777]: -8> 2022-05- > 03T11:59:37.512+0200 7fffe73b5700 4 mds.1.73897 handle_osd_map epoch > 9541, 0 new blacklist entries > May 03 11:59:37 fh_ceph_b conmon[12777]: -7> 2022-05- > 03T11:59:37.512+0200 7fffe73b5700 10 monclient: _renew_subs > May 03 11:59:37 fh_ceph_b conmon[12777]: -6> 2022-05- > 03T11:59:37.512+0200 7fffe73b5700 10 monclient: _send_mon_message to > mon.fh_ceph_b at v2:10.251.23.112:3300/0 > May 03 11:59:37 fh_ceph_b conmon[12777]: -5> 2022-05- > 03T11:59:37.512+0200 7fffe73b5700 4 mgrc ms_handle_reset > ms_handle_reset con 0x5555565eb800 > May 03 11:59:37 fh_ceph_b conmon[12777]: -4> 2022-05- > 03T11:59:37.512+0200 7fffe73b5700 4 mgrc reconnect Terminating session > with v2:10.251.23.112:6842/55 > May 03 11:59:37 fh_ceph_b conmon[12777]: -3> 2022-05- > 03T11:59:37.512+0200 7fffe73b5700 4 mgrc reconnect Starting new > session with [v2:10.251.23.112:6842/55,v1:10.251.23.112:6843/55] > May 03 11:59:37 fh_ceph_b conmon[12777]: -2> 2022-05- > 03T11:59:37.512+0200 7fffe73b5700 1 mds.fh_ceph_b Updating MDS map to > version 73898 from mon.1 > May 03 11:59:37 fh_ceph_b conmon[12777]: -1> 2022-05- > 03T11:59:37.512+0200 7fffe73b5700 1 mds.fh_ceph_b Map removed me > [mds.fh_ceph_b{1:479935} state up:replay seq 1 addr > [v2:10.251.23.112:6800/3613272291,v1:10.251.23.112:6801/3613272291]] > from cluster; respawning! See cluster/monitor logs for details. > May 03 11:59:37 fh_ceph_b conmon[12777]: 0> 2022-05- > 03T11:59:37.512+0200 7fffe73b5700 1 mds.fh_ceph_b respawn! > May 03 11:59:37 fh_ceph_b conmon[12777]: --- logging levels --- > May 03 11:59:37 fh_ceph_b conmon[12777]: 0/ 5 none > .... > > May 03 11:59:37 fh_ceph_b conmon[12777]: 99/99 (stderr threshold) > May 03 11:59:37 fh_ceph_b conmon[12777]: --- pthread ID / name mapping > for recent threads --- > May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe03a7700 / > May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe0ba8700 / MR_Finisher > May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe1baa700 / PQ_Finisher > May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe4bb0700 / ceph-mds > May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe53b1700 / safe_timer > May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe5bb2700 / fn_anonymous > May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe63b3700 / safe_timer > May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe73b5700 / ms_dispatch > May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe83b7700 / admin_socket > May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe8bb8700 / service > May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe93b9700 / msgr-worker-2 > May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffe9bba700 / msgr-worker-1 > May 03 11:59:37 fh_ceph_b conmon[12777]: 7fffea3bb700 / msgr-worker-0 > May 03 11:59:37 fh_ceph_b conmon[12777]: 7ffff7fe0600 / ceph-mds > May 03 11:59:37 fh_ceph_b conmon[12777]: max_recent 10000 > May 03 11:59:37 fh_ceph_b conmon[12777]: max_new 1000 > May 03 11:59:37 fh_ceph_b conmon[12777]: log_file > May 03 11:59:37 fh_ceph_b conmon[12777]: --- end dump of recent events > --- > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b e: '/usr/bin/ceph-mds' > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b 0: '/usr/bin/ceph-mds' > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b 1: '--cluster' > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b 2: 'freihaus' > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b 3: '--setuser' > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b 4: 'ceph' > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b 5: '--setgroup' > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b 6: 'ceph' > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b 7: '--default-log-to- > May 03 11:59:37 fh_ceph_b conmon[12777]: stderr=true' > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b 8: '--err-to-stderr=true' > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b 9: '--default-log-to-file=false' > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b 10: '--foreground' > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b 11: '-i' > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b 12: 'fh_ceph_b' > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b respawning with exe /usr/bin/ceph-mds > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200 > 7fffe73b5700 1 mds.fh_ceph_b exe_path /proc/self/exe > May 03 11:59:37 fh_ceph_b conmon[12777]: ignoring --setuser ceph since > I am not root > May 03 11:59:37 fh_ceph_b conmon[12777]: ignoring --setgroup ceph since > I am not root > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.577+0200 > 7ffff7fe0600 0 ceph version 15.2.12 > (ce065eabfa5ce81323b009786bdf5bb03127cbe1) octopus (stable), process > ceph-mds, pid 51 > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.577+0200 > 7ffff7fe0600 1 main not setting numa affinity > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.578+0200 > 7ffff7fe0600 0 pidfile_write: ignore empty --pid-file > May 03 11:59:37 fh_ceph_b conmon[12777]: starting mds.fh_ceph_b at > May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.581+0200 > 7fffe73b5700 1 mds.fh_ceph_b Updating MDS map to version 73900 from > mon.1 > May 03 11:59:38 fh_ceph_b conmon[12777]: 2022-05-03T11:59:38.397+0200 > 7fffe73b5700 1 mds.fh_ceph_b Updating MDS map to version 73901 from > mon.1 > May 03 11:59:38 fh_ceph_b conmon[12777]: 2022-05-03T11:59:38.397+0200 > 7fffe73b5700 1 mds.fh_ceph_b Monitors have assigned me to become a > standby. > May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.185+0200 > 7fffe73b5700 1 mds.fh_ceph_b Updating MDS map to version 73902 from > mon.1 > May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200 > 7fffe73b5700 1 mds.1.73902 handle_mds_map i am now mds.1.73902 > May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200 > 7fffe73b5700 1 mds.1.73902 handle_mds_map state change up:boot --> > up:replay > May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200 > 7fffe73b5700 1 mds.1.73902 replay_start > May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200 > 7fffe73b5700 1 mds.1.73902 waiting for osdmap 9543 (which blacklists > prior instance) > May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.288+0200 > 7fffe0ba8700 0 mds.1.cache creating system inode with ino:0x101 > May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.288+0200 > 7fffe0ba8700 0 mds.1.cache creating system inode with ino:0x1 > > May 03 13:20:42 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:42.664+0200 > 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after > 15 > May 03 13:20:42 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:42.664+0200 > 7fffe4bb0700 0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to > monitors (last acked 6.50192s ago); MDS internal heartbeat is not > healthy! > May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.164+0200 > 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after > 15 > May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.164+0200 > 7fffe4bb0700 0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to > monitors (last acked 7.00191s ago); MDS internal heartbeat is not > healthy! > May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.663+0200 > 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after > 15 > May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.663+0200 > 7fffe4bb0700 0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to > monitors (last acked 7.50091s ago); MDS internal heartbeat is not > healthy! > May 03 13:20:44 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:44.163+0200 > 7fffe4bb0700 1 heartbeat_map is_healthy 'MDSRank' had timed out after > 15 > May 03 13:20:44 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:44.163+0200 > 7fffe4bb0700 0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to > monitors (last acked 8.0009s ago); MDS internal heartbeat is not > healthy! > > (Sorry when I now double posted this, I think I have to be subscribed > to post here what I was not with my last try) > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx