Re: [CephFS, MDS] internal MDS internal heartbeat is not healthy!

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 3 May 2022 13:35:02 -0700

Okay, so you started out with 2 active MDSes and then they failed on a restart?
And in an effort to fix it you changed max_mds to 3? (That was a bad
idea, but *probably* didn't actually hurt anything this time — adding
new work to scale out a system which already can't turn on just
overloads it more!)

The logs here are not very apparent about what's going on. You should
set "debug ms = 1" and "debug mds = 20" on your MDSes, restart them
all, and then use ceph-post-file to upload them for analysis. The logs
here are very sparse and if the MDS internal heartbeat is unhealthy
there's something wrong in the depths that unfortunately isn't being
output in what's visible.
-Greg

On Tue, May 3, 2022 at 1:25 PM Wagner-Kerschbaumer
<wagner-kerschbaumer@xxxxxxxxxxxxx> wrote:
>
> i All!
> My CephFS data pool on a 15.2.12 stopped working overnight.
> I have to much data on there what I planned to migrate today. (Not
> possible now after I cant get cephfs back up)
>
> Something is very off, and I cant pinpoint what. the mds keeps failing
>
> May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.343+0200
> 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
> 15
> May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.343+0200
> 7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
> monitors (last acked 24.0037s ago
> ); MDS internal heartbeat is not healthy!
> May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.843+0200
> 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
> 15
> May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.843+0200
> 7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
> monitors (last acked 24.5037s ago); MDS internal heartbeat is not
> healthy!
> May 03 11:58:41 fh_ceph_a conmon[4835]: 2022-05-03T11:58:41.343+0200
> 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
> 15
> May 03 11:58:41 fh_ceph_a conmon[4835]: 2022-05-03T11:58:41.343+0200
> 7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
> monitors (last acked 25.0037s ago); MDS internal heartbeat is not
> healthy!
> [root@fh_ceph_b /]#free -h
>               total        used        free      shared  buff/cache
> available
> Mem:          251Gi       168Gi        75Gi       4.0Gi       7.1Gi
> 70Gi
> Swap:         4.0Gi          0B       4.0Gi
> [root@fh_ceph_b /]# ceph -s
>   cluster:
>     id:     deadbeef-7d25-40ec-abc4-202104a6d54a
>     health: HEALTH_WARN
>             1 filesystem is degraded
>             1 nearfull osd(s)
>             13 pool(s) nearfull
>
>   services:
>     mon: 3 daemons, quorum fh_ceph_a,fh_ceph_b,fh_ceph_c (age 5M)
>     mgr: fh_ceph_b(active, since 5M), standbys: fh_ceph_a, fh_ceph_c,
> fh_ceph_d
>     mds: cephfs:2/2 {0=fh_ceph_c=up:resolve,1=fh_ceph_a=up:replay} 1
> up:standby
>     osd: 40 osds: 40 up (since 5M), 40 in (since 5M)
>     rgw: 4 daemons active (fh_ceph_a.rgw0, fh_ceph_b.rgw0,
> fh_ceph_c.rgw0, fh_ceph_d.rgw0)
>
>   task status:
>
>   data:
>     pools:   13 pools, 1929 pgs
>     objects: 48.08M objects, 122 TiB
>     usage:   423 TiB used, 215 TiB / 638 TiB avail
>     pgs:     1922 active+clean
>              7    active+clean+scrubbing+deep
>
>   io:
>     client:   6.2 MiB/s rd, 2 op/s rd, 0 op/s wr
>
> after setting ceph fs set cephfs max_mds 3 and some time the state on
> one changed at least to resolve
>
> (example)
> [root@fh_ceph_a ~] :date ; podman exec ceph-mon-fh_ceph_a ceph fs
> status cephfs
> Tue  3 May 12:14:12 CEST 2022
> cephfs - 40 clients
> ======
> RANK   STATE      MDS     ACTIVITY   DNS    INOS
>  0    resolve  fh_ceph_c            27.0k  27.0k
>  1     replay  fh_ceph_d               0      0
>       POOL         TYPE     USED  AVAIL
> cephfs_metadata  metadata  48.7G  17.5T
>   cephfs_data      data     367T  17.5T
> STANDBY MDS
>  fh_ceph_b
>  fh_ceph_a
> MDS version: ceph version 15.2.12
> (ce065eabfa5ce81323b009786bdf5bb03127cbe1) octopus (stable)
>
> logs of failing mds (journalctl -f -u ceph-mds@$(hostname).service --
> since "5 minutes ago")
> May 03 11:59:37 fh_ceph_b conmon[12777]:    -20> 2022-05-
> 03T11:59:36.068+0200 7fffe63b3700 10 monclient: _check_auth_rotating
> have uptodate secrets (they expire after 2022-05-
> 03T11:59:06.069985+0200)
> May 03 11:59:37 fh_ceph_b conmon[12777]:    -19> 2022-05-
> 03T11:59:36.085+0200 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank'
> had timed out after 15
> May 03 11:59:37 fh_ceph_b conmon[12777]:    -18> 2022-05-
> 03T11:59:36.085+0200 7fffe4bb0700  0 mds.beacon.fh_ceph_b Skipping
> beacon heartbeat to monitors (last acked 51.0078s ago); MDS internal
> heartbeat is not healthy!
> May 03 11:59:37 fh_ceph_b conmon[12777]:    -17> 2022-05-
> 03T11:59:36.585+0200 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank'
> had timed out after 15
> May 03 11:59:37 fh_ceph_b conmon[12777]:    -16> 2022-05-
> 03T11:59:36.585+0200 7fffe4bb0700  0 mds.beacon.fh_ceph_b Skipping
> beacon heartbeat to monitors (last acked 51.5078s ago); MDS internal
> heartbeat is not healthy!
> May 03 11:59:37 fh_ceph_b conmon[12777]:    -15> 2022-05-
> 03T11:59:37.068+0200 7fffe63b3700 10 monclient: tick
> May 03 11:59:37 fh_ceph_b conmon[12777]:    -14> 2022-05-
> 03T11:59:37.068+0200 7fffe63b3700 10 monclient: _check_auth_rotating
> have uptodate secrets (they expire after 2022-05-
> 03T11:59:07.070107+0200)
> May 03 11:59:37 fh_ceph_b conmon[12777]:    -13> 2022-05-
> 03T11:59:37.085+0200 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank'
> had timed out after 15
> May 03 11:59:37 fh_ceph_b conmon[12777]:    -12> 2022-05-
> 03T11:59:37.085+0200 7fffe4bb0700  0 mds.beacon.fh_ceph_b Skipping
> beacon heartbeat to monitors (last acked 52.0078s ago); MDS internal
> heartbeat is not healthy!
> May 03 11:59:37 fh_ceph_b conmon[12777]:    -11> 2022-05-
> 03T11:59:37.512+0200 7fffe53b1700  1 heartbeat_map reset_timeout
> 'MDSRank' had timed out after 15
> May 03 11:59:37 fh_ceph_b conmon[12777]:    -10> 2022-05-
> 03T11:59:37.512+0200 7fffe53b1700  1 mds.beacon.fh_ceph_b MDS
> connection to Monitors appears to be laggy; 52.4348s since last acked
> beacon
> May 03 11:59:37 fh_ceph_b conmon[12777]:     -9> 2022-05-
> 03T11:59:37.512+0200 7fffe53b1700  1 mds.1.73897 skipping upkeep work
> because connection to Monitors appears laggy
> May 03 11:59:37 fh_ceph_b conmon[12777]:     -8> 2022-05-
> 03T11:59:37.512+0200 7fffe73b5700  4 mds.1.73897 handle_osd_map epoch
> 9541, 0 new blacklist entries
> May 03 11:59:37 fh_ceph_b conmon[12777]:     -7> 2022-05-
> 03T11:59:37.512+0200 7fffe73b5700 10 monclient: _renew_subs
> May 03 11:59:37 fh_ceph_b conmon[12777]:     -6> 2022-05-
> 03T11:59:37.512+0200 7fffe73b5700 10 monclient: _send_mon_message to
> mon.fh_ceph_b at v2:10.251.23.112:3300/0
> May 03 11:59:37 fh_ceph_b conmon[12777]:     -5> 2022-05-
> 03T11:59:37.512+0200 7fffe73b5700  4 mgrc ms_handle_reset
> ms_handle_reset con 0x5555565eb800
> May 03 11:59:37 fh_ceph_b conmon[12777]:     -4> 2022-05-
> 03T11:59:37.512+0200 7fffe73b5700  4 mgrc reconnect Terminating session
> with v2:10.251.23.112:6842/55
> May 03 11:59:37 fh_ceph_b conmon[12777]:     -3> 2022-05-
> 03T11:59:37.512+0200 7fffe73b5700  4 mgrc reconnect Starting new
> session with [v2:10.251.23.112:6842/55,v1:10.251.23.112:6843/55]
> May 03 11:59:37 fh_ceph_b conmon[12777]:     -2> 2022-05-
> 03T11:59:37.512+0200 7fffe73b5700  1 mds.fh_ceph_b Updating MDS map to
> version 73898 from mon.1
> May 03 11:59:37 fh_ceph_b conmon[12777]:     -1> 2022-05-
> 03T11:59:37.512+0200 7fffe73b5700  1 mds.fh_ceph_b Map removed me
> [mds.fh_ceph_b{1:479935} state up:replay seq 1 addr
> [v2:10.251.23.112:6800/3613272291,v1:10.251.23.112:6801/3613272291]]
> from cluster; respawning! See cluster/monitor logs for details.
> May 03 11:59:37 fh_ceph_b conmon[12777]:      0> 2022-05-
> 03T11:59:37.512+0200 7fffe73b5700  1 mds.fh_ceph_b respawn!
> May 03 11:59:37 fh_ceph_b conmon[12777]: --- logging levels ---
> May 03 11:59:37 fh_ceph_b conmon[12777]:    0/ 5 none
> ....
>
> May 03 11:59:37 fh_ceph_b conmon[12777]:   99/99 (stderr threshold)
> May 03 11:59:37 fh_ceph_b conmon[12777]: --- pthread ID / name mapping
> for recent threads ---
> May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe03a7700 /
> May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe0ba8700 / MR_Finisher
> May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe1baa700 / PQ_Finisher
> May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe4bb0700 / ceph-mds
> May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe53b1700 / safe_timer
> May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe5bb2700 / fn_anonymous
> May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe63b3700 / safe_timer
> May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe73b5700 / ms_dispatch
> May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe83b7700 / admin_socket
> May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe8bb8700 / service
> May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe93b9700 / msgr-worker-2
> May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe9bba700 / msgr-worker-1
> May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffea3bb700 / msgr-worker-0
> May 03 11:59:37 fh_ceph_b conmon[12777]:   7ffff7fe0600 / ceph-mds
> May 03 11:59:37 fh_ceph_b conmon[12777]:   max_recent     10000
> May 03 11:59:37 fh_ceph_b conmon[12777]:   max_new         1000
> May 03 11:59:37 fh_ceph_b conmon[12777]:   log_file
> May 03 11:59:37 fh_ceph_b conmon[12777]: --- end dump of recent events
> ---
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b  e: '/usr/bin/ceph-mds'
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b  0: '/usr/bin/ceph-mds'
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b  1: '--cluster'
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b  2: 'freihaus'
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b  3: '--setuser'
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b  4: 'ceph'
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b  5: '--setgroup'
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b  6: 'ceph'
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b  7: '--default-log-to-
> May 03 11:59:37 fh_ceph_b conmon[12777]: stderr=true'
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b  8: '--err-to-stderr=true'
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b  9: '--default-log-to-file=false'
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b  10: '--foreground'
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b  11: '-i'
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b  12: 'fh_ceph_b'
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b respawning with exe /usr/bin/ceph-mds
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
> 7fffe73b5700  1 mds.fh_ceph_b  exe_path /proc/self/exe
> May 03 11:59:37 fh_ceph_b conmon[12777]: ignoring --setuser ceph since
> I am not root
> May 03 11:59:37 fh_ceph_b conmon[12777]: ignoring --setgroup ceph since
> I am not root
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.577+0200
> 7ffff7fe0600  0 ceph version 15.2.12
> (ce065eabfa5ce81323b009786bdf5bb03127cbe1) octopus (stable), process
> ceph-mds, pid 51
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.577+0200
> 7ffff7fe0600  1 main not setting numa affinity
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.578+0200
> 7ffff7fe0600  0 pidfile_write: ignore empty --pid-file
> May 03 11:59:37 fh_ceph_b conmon[12777]: starting mds.fh_ceph_b at
> May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.581+0200
> 7fffe73b5700  1 mds.fh_ceph_b Updating MDS map to version 73900 from
> mon.1
> May 03 11:59:38 fh_ceph_b conmon[12777]: 2022-05-03T11:59:38.397+0200
> 7fffe73b5700  1 mds.fh_ceph_b Updating MDS map to version 73901 from
> mon.1
> May 03 11:59:38 fh_ceph_b conmon[12777]: 2022-05-03T11:59:38.397+0200
> 7fffe73b5700  1 mds.fh_ceph_b Monitors have assigned me to become a
> standby.
> May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.185+0200
> 7fffe73b5700  1 mds.fh_ceph_b Updating MDS map to version 73902 from
> mon.1
> May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200
> 7fffe73b5700  1 mds.1.73902 handle_mds_map i am now mds.1.73902
> May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200
> 7fffe73b5700  1 mds.1.73902 handle_mds_map state change up:boot -->
> up:replay
> May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200
> 7fffe73b5700  1 mds.1.73902 replay_start
> May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200
> 7fffe73b5700  1 mds.1.73902  waiting for osdmap 9543 (which blacklists
> prior instance)
> May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.288+0200
> 7fffe0ba8700  0 mds.1.cache creating system inode with ino:0x101
> May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.288+0200
> 7fffe0ba8700  0 mds.1.cache creating system inode with ino:0x1
>
> May 03 13:20:42 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:42.664+0200
> 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
> 15
> May 03 13:20:42 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:42.664+0200
> 7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
> monitors (last acked 6.50192s ago); MDS internal heartbeat is not
> healthy!
> May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.164+0200
> 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
> 15
> May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.164+0200
> 7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
> monitors (last acked 7.00191s ago); MDS internal heartbeat is not
> healthy!
> May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.663+0200
> 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
> 15
> May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.663+0200
> 7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
> monitors (last acked 7.50091s ago); MDS internal heartbeat is not
> healthy!
> May 03 13:20:44 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:44.163+0200
> 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
> 15
> May 03 13:20:44 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:44.163+0200
> 7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
> monitors (last acked 8.0009s ago); MDS internal heartbeat is not
> healthy!
>
> (Sorry when I now double posted this, I think I have to be subscribed
> to post here what I was not with my last try)
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx