Hi All My main CephFS data pool on a Luminous 12.2.10 cluster hit capacity overnight, metadata is on a separate pool which didn't hit capacity but the filesystem stopped working which I'd expect. I increased the osd full-ratio to give me some breathing room to get some data deleted once the filesystem is back online. When I attempt to restart the MDS service, I see the usual stuff I'd expect in the log but then: heartbeat_map is_healthy 'MDSRank' had timed out after 15 Followed by: mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last > acked 4.00013s ago); MDS internal heartbeat is not healthy! Eventually I get: > > mds.beacon.hostnamecephssd01 is_laggy 29.372 > 15 since last acked beacon > mds.0.90884 skipping upkeep work because connection to Monitors appears > laggy > mds.hostnamecephssd01 Updating MDS map to version 90885 from mon.0 > mds.beacon.hostnamecephssd01 MDS is no longer laggy The "MDS is no longer laggy" appears to be where the service fails Meanwhile a ceph -s is showing: > > cluster: > id: 5c5998fd-dc9b-47ec-825e-beaba66aad11 > health: HEALTH_ERR > 1 filesystem is degraded > insufficient standby MDS daemons available > 67 backfillfull osd(s) > 11 nearfull osd(s) > full ratio(s) out of order > 2 pool(s) backfillfull > 2 pool(s) nearfull > 6 scrub errors > Possible data damage: 5 pgs inconsistent > services: > mon: 3 daemons, quorum hostnameceph01,hostnameceph02,hostnameceph03 > mgr: hostnameceph03(active), standbys: hostnameceph02, hostnameceph01 > mds: cephfs-1/1/1 up {0=hostnamecephssd01=up:replay} > osd: 172 osds: 161 up, 161 in > data: > pools: 5 pools, 8384 pgs > objects: 76.25M objects, 124TiB > usage: 373TiB used, 125TiB / 498TiB avail > pgs: 8379 active+clean > 5 active+clean+inconsistent > io: > client: 676KiB/s rd, 0op/s rd, 0op/s w The 5 pgs inconsistent is not a new issue, that is from past scrubs, just haven't gotten around to manually clearing them although I suppose they could be related to my issue The cluster has no clients connected I did notice in the ceph.log, some OSDs that are in the same host as the MDS service briefly went down when trying to restart the MDS but examining the logs of those particular OSDs isn't showing any glaring issues. Full MDS log at debug 5 (can go higher if needed): 2020-10-22 11:27:10.987652 7f6f696f5240 0 set uid:gid to 167:167 (ceph:ceph) 2020-10-22 11:27:10.987669 7f6f696f5240 0 ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable), process ceph-mds, pid 2022582 2020-10-22 11:27:10.990567 7f6f696f5240 0 pidfile_write: ignore empty --pid-file 2020-10-22 11:27:11.027981 7f6f62616700 1 mds.hostnamecephssd01 Updating MDS map to version 90882 from mon.0 2020-10-22 11:27:15.097957 7f6f62616700 1 mds.hostnamecephssd01 Updating MDS map to version 90883 from mon.0 2020-10-22 11:27:15.097989 7f6f62616700 1 mds.hostnamecephssd01 Map has assigned me to become a standby 2020-10-22 11:27:15.101071 7f6f62616700 1 mds.hostnamecephssd01 Updating MDS map to version 90884 from mon.0 2020-10-22 11:27:15.105310 7f6f62616700 1 mds.0.90884 handle_mds_map i am now mds.0.90884 2020-10-22 11:27:15.105316 7f6f62616700 1 mds.0.90884 handle_mds_map state change up:boot --> up:replay 2020-10-22 11:27:15.105325 7f6f62616700 1 mds.0.90884 replay_start 2020-10-22 11:27:15.105333 7f6f62616700 1 mds.0.90884 recovery set is 2020-10-22 11:27:15.105344 7f6f62616700 1 mds.0.90884 waiting for osdmap 73745 (which blacklists prior instance) 2020-10-22 11:27:15.149092 7f6f5be09700 0 mds.0.cache creating system inode with ino:0x100 2020-10-22 11:27:15.149693 7f6f5be09700 0 mds.0.cache creating system inode with ino:0x1 2020-10-22 11:27:41.021708 7f6f63618700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:27:43.029290 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:27:43.029297 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 4.00013s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:27:45.866711 7f6f5fe11700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-10-22 11:28:01.021965 7f6f63618700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:03.029862 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:03.029885 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 4.00113s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:28:06.022033 7f6f63618700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:07.029955 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:07.029961 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 8.00126s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:28:11.022099 7f6f63618700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:11.030024 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:11.030028 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 12.0014s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:28:15.030092 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:15.030099 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 16.0015s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:28:16.022165 7f6f63618700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:19.030163 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:19.030169 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 20.0016s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:28:21.022231 7f6f63618700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:23.030233 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:23.030241 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 24.0008s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:28:26.022295 7f6f63618700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:27.030305 7f6f5f610700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:28:27.030311 7f6f5f610700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 28.0009s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:28:28.401161 7f6f5fe11700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-10-22 11:28:28.401168 7f6f5fe11700 1 mds.beacon.hostnamecephssd01 is_laggy 29.372 > 15 since last acked beacon 2020-10-22 11:28:28.401177 7f6f5fe11700 1 mds.0.90884 skipping upkeep work because connection to Monitors appears laggy 2020-10-22 11:28:28.401187 7f6f62616700 1 mds.hostnamecephssd01 Updating MDS map to version 90885 from mon.0 2020-10-22 11:28:31.659817 7f6f64595700 0 mds.beacon.hostnamecephssd01 MDS is no longer laggy 2020-10-22 11:36:15.880009 7f88ee4ac240 0 set uid:gid to 167:167 (ceph:ceph) 2020-10-22 11:36:15.880026 7f88ee4ac240 0 ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable), process ceph-mds, pid 2022663 2020-10-22 11:36:15.883118 7f88ee4ac240 0 pidfile_write: ignore empty --pid-file 2020-10-22 11:36:15.921200 7f88e73cd700 1 mds.hostnamecephssd01 Updating MDS map to version 90887 from mon.2 2020-10-22 11:36:20.270298 7f88e73cd700 1 mds.hostnamecephssd01 Updating MDS map to version 90888 from mon.2 2020-10-22 11:36:20.270329 7f88e73cd700 1 mds.hostnamecephssd01 Map has assigned me to become a standby 2020-10-22 11:36:20.272917 7f88e73cd700 1 mds.hostnamecephssd01 Updating MDS map to version 90889 from mon.2 2020-10-22 11:36:20.277063 7f88e73cd700 1 mds.0.90889 handle_mds_map i am now mds.0.90889 2020-10-22 11:36:20.277069 7f88e73cd700 1 mds.0.90889 handle_mds_map state change up:boot --> up:replay 2020-10-22 11:36:20.277079 7f88e73cd700 1 mds.0.90889 replay_start 2020-10-22 11:36:20.277086 7f88e73cd700 1 mds.0.90889 recovery set is 2020-10-22 11:36:20.277096 7f88e73cd700 1 mds.0.90889 waiting for osdmap 73746 (which blacklists prior instance) 2020-10-22 11:36:20.322318 7f88e0bc0700 0 mds.0.cache creating system inode with ino:0x100 2020-10-22 11:36:20.322918 7f88e0bc0700 0 mds.0.cache creating system inode with ino:0x1 2020-10-22 11:36:47.922531 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:36:47.922549 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 4.00013s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:36:50.914516 7f88e83cf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:36:51.351457 7f88e4bc8700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-10-22 11:37:07.923089 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:07.923126 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 3.99913s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:37:10.914767 7f88e83cf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:11.923216 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:11.923223 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 7.99926s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:37:15.914831 7f88e83cf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:15.923286 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:15.923294 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 11.9994s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:37:19.923359 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:19.923366 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 15.9995s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:37:20.914917 7f88e83cf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:23.923430 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:23.923437 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 19.9996s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:37:25.914981 7f88e83cf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:27.923501 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:27.923508 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 23.9998s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:37:30.915046 7f88e83cf700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:31.923572 7f88e43c7700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2020-10-22 11:37:31.923579 7f88e43c7700 0 mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last acked 27.9999s ago); MDS internal heartbeat is not healthy! 2020-10-22 11:37:32.412628 7f88e4bc8700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-10-22 11:37:32.412635 7f88e4bc8700 1 mds.beacon.hostnamecephssd01 is_laggy 28.4889 > 15 since last acked beacon 2020-10-22 11:37:32.412643 7f88e4bc8700 1 mds.0.90889 skipping upkeep work because connection to Monitors appears laggy 2020-10-22 11:37:32.412657 7f88e73cd700 1 mds.hostnamecephssd01 Updating MDS map to version 90890 from mon.2 2020-10-22 11:37:35.978858 7f88e934c700 0 mds.beacon.hostnamecephssd01 MDS is no longer laggy Thanks in advance for any assistance you can provide! David _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx