[CephFS, MDS] internal MDS internal heartbeat is not healthy!

Wagner-Kerschbaumer <wagner-kerschbaumer@xxxxxxxxxxxxx> · Tue, 03 May 2022 13:59:19 +0200

i All!
My CephFS data pool on a 15.2.12 stopped working overnight.
I have to much data on there what I planned to migrate today. (Not
possible now after I cant get cephfs back up)

Something is very off, and I cant pinpoint what. the mds keeps failing

May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.343+0200
7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
15
May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.343+0200
7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
monitors (last acked 24.0037s ago
); MDS internal heartbeat is not healthy!
May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.843+0200
7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
15
May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.843+0200
7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
monitors (last acked 24.5037s ago); MDS internal heartbeat is not
healthy!
May 03 11:58:41 fh_ceph_a conmon[4835]: 2022-05-03T11:58:41.343+0200
7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
15
May 03 11:58:41 fh_ceph_a conmon[4835]: 2022-05-03T11:58:41.343+0200
7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
monitors (last acked 25.0037s ago); MDS internal heartbeat is not
healthy!
[root@fh_ceph_b /]#free -h
              total        used        free      shared  buff/cache  
available
Mem:          251Gi       168Gi        75Gi       4.0Gi       7.1Gi   
70Gi
Swap:         4.0Gi          0B       4.0Gi
[root@fh_ceph_b /]# ceph -s
  cluster:
    id:     deadbeef-7d25-40ec-abc4-202104a6d54a
    health: HEALTH_WARN
            1 filesystem is degraded
            1 nearfull osd(s)
            13 pool(s) nearfull

  services:
    mon: 3 daemons, quorum fh_ceph_a,fh_ceph_b,fh_ceph_c (age 5M)
    mgr: fh_ceph_b(active, since 5M), standbys: fh_ceph_a, fh_ceph_c,
fh_ceph_d
    mds: cephfs:2/2 {0=fh_ceph_c=up:resolve,1=fh_ceph_a=up:replay} 1
up:standby
    osd: 40 osds: 40 up (since 5M), 40 in (since 5M)
    rgw: 4 daemons active (fh_ceph_a.rgw0, fh_ceph_b.rgw0,
fh_ceph_c.rgw0, fh_ceph_d.rgw0)

  task status:

  data:
    pools:   13 pools, 1929 pgs
    objects: 48.08M objects, 122 TiB
    usage:   423 TiB used, 215 TiB / 638 TiB avail
    pgs:     1922 active+clean
             7    active+clean+scrubbing+deep

  io:
    client:   6.2 MiB/s rd, 2 op/s rd, 0 op/s wr

after setting ceph fs set cephfs max_mds 3 and some time the state on
one changed at least to resolve

(example)
[root@fh_ceph_a ~] :date ; podman exec ceph-mon-fh_ceph_a ceph fs
status cephfs         
Tue  3 May 12:14:12 CEST 2022
cephfs - 40 clients
======
RANK   STATE      MDS     ACTIVITY   DNS    INOS
 0    resolve  fh_ceph_c            27.0k  27.0k
 1     replay  fh_ceph_d               0      0
      POOL         TYPE     USED  AVAIL
cephfs_metadata  metadata  48.7G  17.5T
  cephfs_data      data     367T  17.5T
STANDBY MDS
 fh_ceph_b
 fh_ceph_a
MDS version: ceph version 15.2.12
(ce065eabfa5ce81323b009786bdf5bb03127cbe1) octopus (stable)   

logs of failing mds (journalctl -f -u ceph-mds@$(hostname).service --
since "5 minutes ago")
May 03 11:59:37 fh_ceph_b conmon[12777]:    -20> 2022-05-
03T11:59:36.068+0200 7fffe63b3700 10 monclient: _check_auth_rotating
have uptodate secrets (they expire after 2022-05-
03T11:59:06.069985+0200) 
May 03 11:59:37 fh_ceph_b conmon[12777]:    -19> 2022-05-
03T11:59:36.085+0200 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank'
had timed out after 15     
May 03 11:59:37 fh_ceph_b conmon[12777]:    -18> 2022-05-
03T11:59:36.085+0200 7fffe4bb0700  0 mds.beacon.fh_ceph_b Skipping
beacon heartbeat to monitors (last acked 51.0078s ago); MDS internal
heartbeat is not healthy!
May 03 11:59:37 fh_ceph_b conmon[12777]:    -17> 2022-05-
03T11:59:36.585+0200 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank'
had timed out after 15     
May 03 11:59:37 fh_ceph_b conmon[12777]:    -16> 2022-05-
03T11:59:36.585+0200 7fffe4bb0700  0 mds.beacon.fh_ceph_b Skipping
beacon heartbeat to monitors (last acked 51.5078s ago); MDS internal
heartbeat is not healthy!
May 03 11:59:37 fh_ceph_b conmon[12777]:    -15> 2022-05-
03T11:59:37.068+0200 7fffe63b3700 10 monclient: tick  
May 03 11:59:37 fh_ceph_b conmon[12777]:    -14> 2022-05-
03T11:59:37.068+0200 7fffe63b3700 10 monclient: _check_auth_rotating
have uptodate secrets (they expire after 2022-05-
03T11:59:07.070107+0200) 
May 03 11:59:37 fh_ceph_b conmon[12777]:    -13> 2022-05-
03T11:59:37.085+0200 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank'
had timed out after 15     
May 03 11:59:37 fh_ceph_b conmon[12777]:    -12> 2022-05-
03T11:59:37.085+0200 7fffe4bb0700  0 mds.beacon.fh_ceph_b Skipping
beacon heartbeat to monitors (last acked 52.0078s ago); MDS internal
heartbeat is not healthy!
May 03 11:59:37 fh_ceph_b conmon[12777]:    -11> 2022-05-
03T11:59:37.512+0200 7fffe53b1700  1 heartbeat_map reset_timeout
'MDSRank' had timed out after 15  
May 03 11:59:37 fh_ceph_b conmon[12777]:    -10> 2022-05-
03T11:59:37.512+0200 7fffe53b1700  1 mds.beacon.fh_ceph_b MDS
connection to Monitors appears to be laggy; 52.4348s since last acked
beacon     
May 03 11:59:37 fh_ceph_b conmon[12777]:     -9> 2022-05-
03T11:59:37.512+0200 7fffe53b1700  1 mds.1.73897 skipping upkeep work
because connection to Monitors appears laggy
May 03 11:59:37 fh_ceph_b conmon[12777]:     -8> 2022-05-
03T11:59:37.512+0200 7fffe73b5700  4 mds.1.73897 handle_osd_map epoch
9541, 0 new blacklist entries
May 03 11:59:37 fh_ceph_b conmon[12777]:     -7> 2022-05-
03T11:59:37.512+0200 7fffe73b5700 10 monclient: _renew_subs
May 03 11:59:37 fh_ceph_b conmon[12777]:     -6> 2022-05-
03T11:59:37.512+0200 7fffe73b5700 10 monclient: _send_mon_message to
mon.fh_ceph_b at v2:10.251.23.112:3300/0     
May 03 11:59:37 fh_ceph_b conmon[12777]:     -5> 2022-05-
03T11:59:37.512+0200 7fffe73b5700  4 mgrc ms_handle_reset
ms_handle_reset con 0x5555565eb800       
May 03 11:59:37 fh_ceph_b conmon[12777]:     -4> 2022-05-
03T11:59:37.512+0200 7fffe73b5700  4 mgrc reconnect Terminating session
with v2:10.251.23.112:6842/55  
May 03 11:59:37 fh_ceph_b conmon[12777]:     -3> 2022-05-
03T11:59:37.512+0200 7fffe73b5700  4 mgrc reconnect Starting new
session with [v2:10.251.23.112:6842/55,v1:10.251.23.112:6843/55]   
May 03 11:59:37 fh_ceph_b conmon[12777]:     -2> 2022-05-
03T11:59:37.512+0200 7fffe73b5700  1 mds.fh_ceph_b Updating MDS map to
version 73898 from mon.1    
May 03 11:59:37 fh_ceph_b conmon[12777]:     -1> 2022-05-
03T11:59:37.512+0200 7fffe73b5700  1 mds.fh_ceph_b Map removed me
[mds.fh_ceph_b{1:479935} state up:replay seq 1 addr
[v2:10.251.23.112:6800/3613272291,v1:10.251.23.112:6801/3613272291]]
from cluster; respawning! See cluster/monitor logs for details.
May 03 11:59:37 fh_ceph_b conmon[12777]:      0> 2022-05-
03T11:59:37.512+0200 7fffe73b5700  1 mds.fh_ceph_b respawn!
May 03 11:59:37 fh_ceph_b conmon[12777]: --- logging levels --- 
May 03 11:59:37 fh_ceph_b conmon[12777]:    0/ 5 none
....

May 03 11:59:37 fh_ceph_b conmon[12777]:   99/99 (stderr threshold)
May 03 11:59:37 fh_ceph_b conmon[12777]: --- pthread ID / name mapping
for recent threads ---
May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe03a7700 / 
May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe0ba8700 / MR_Finisher
May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe1baa700 / PQ_Finisher
May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe4bb0700 / ceph-mds
May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe53b1700 / safe_timer
May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe5bb2700 / fn_anonymous
May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe63b3700 / safe_timer
May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe73b5700 / ms_dispatch
May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe83b7700 / admin_socket
May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe8bb8700 / service
May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe93b9700 / msgr-worker-2
May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffe9bba700 / msgr-worker-1
May 03 11:59:37 fh_ceph_b conmon[12777]:   7fffea3bb700 / msgr-worker-0
May 03 11:59:37 fh_ceph_b conmon[12777]:   7ffff7fe0600 / ceph-mds
May 03 11:59:37 fh_ceph_b conmon[12777]:   max_recent     10000
May 03 11:59:37 fh_ceph_b conmon[12777]:   max_new         1000
May 03 11:59:37 fh_ceph_b conmon[12777]:   log_file 
May 03 11:59:37 fh_ceph_b conmon[12777]: --- end dump of recent events
---
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b  e: '/usr/bin/ceph-mds'
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b  0: '/usr/bin/ceph-mds'
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b  1: '--cluster'
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b  2: 'freihaus'
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b  3: '--setuser'
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b  4: 'ceph'
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b  5: '--setgroup'
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b  6: 'ceph'
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b  7: '--default-log-to-
May 03 11:59:37 fh_ceph_b conmon[12777]: stderr=true'
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b  8: '--err-to-stderr=true'
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b  9: '--default-log-to-file=false'
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b  10: '--foreground'
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b  11: '-i'
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b  12: 'fh_ceph_b'
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b respawning with exe /usr/bin/ceph-mds
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.516+0200
7fffe73b5700  1 mds.fh_ceph_b  exe_path /proc/self/exe
May 03 11:59:37 fh_ceph_b conmon[12777]: ignoring --setuser ceph since
I am not root
May 03 11:59:37 fh_ceph_b conmon[12777]: ignoring --setgroup ceph since
I am not root
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.577+0200
7ffff7fe0600  0 ceph version 15.2.12
(ce065eabfa5ce81323b009786bdf5bb03127cbe1) octopus (stable), process
ceph-mds, pid 51                                                      
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.577+0200
7ffff7fe0600  1 main not setting numa affinity
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.578+0200
7ffff7fe0600  0 pidfile_write: ignore empty --pid-file
May 03 11:59:37 fh_ceph_b conmon[12777]: starting mds.fh_ceph_b at
May 03 11:59:37 fh_ceph_b conmon[12777]: 2022-05-03T11:59:37.581+0200
7fffe73b5700  1 mds.fh_ceph_b Updating MDS map to version 73900 from
mon.1
May 03 11:59:38 fh_ceph_b conmon[12777]: 2022-05-03T11:59:38.397+0200
7fffe73b5700  1 mds.fh_ceph_b Updating MDS map to version 73901 from
mon.1
May 03 11:59:38 fh_ceph_b conmon[12777]: 2022-05-03T11:59:38.397+0200
7fffe73b5700  1 mds.fh_ceph_b Monitors have assigned me to become a
standby.
May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.185+0200
7fffe73b5700  1 mds.fh_ceph_b Updating MDS map to version 73902 from
mon.1
May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200
7fffe73b5700  1 mds.1.73902 handle_mds_map i am now mds.1.73902
May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200
7fffe73b5700  1 mds.1.73902 handle_mds_map state change up:boot -->
up:replay
May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200
7fffe73b5700  1 mds.1.73902 replay_start
May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.187+0200
7fffe73b5700  1 mds.1.73902  waiting for osdmap 9543 (which blacklists
prior instance)
May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.288+0200
7fffe0ba8700  0 mds.1.cache creating system inode with ino:0x101
May 03 12:00:04 fh_ceph_b conmon[12777]: 2022-05-03T12:00:04.288+0200
7fffe0ba8700  0 mds.1.cache creating system inode with ino:0x1

May 03 13:20:42 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:42.664+0200
7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
15                                                                    
May 03 13:20:42 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:42.664+0200
7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
monitors (last acked 6.50192s ago); MDS internal heartbeat is not
healthy!                                          
May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.164+0200
7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
15                                                                    
May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.164+0200
7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
monitors (last acked 7.00191s ago); MDS internal heartbeat is not
healthy!                                          
May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.663+0200
7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
15                                                                    
May 03 13:20:43 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:43.663+0200
7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
monitors (last acked 7.50091s ago); MDS internal heartbeat is not
healthy!                                          
May 03 13:20:44 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:44.163+0200
7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
15                                                                    
May 03 13:20:44 fh_ceph_a conmon[3544615]: 2022-05-03T13:20:44.163+0200
7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
monitors (last acked 8.0009s ago); MDS internal heartbeat is not
healthy!

(Sorry when I now double posted this, I think I have to be subscribed
to post here what I was not with my last try)

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx