Hi Steve, I also observed that setting mon_osd_reporter_subtree_level to anything else than host leads to incorrect behavior. In our case, I actually observed the opposite. I had mon_osd_reporter_subtree_level=datacenter (we have 3 DCs in the crush tree). After cutting off a single host with ifdown - also a network cut-off albeit not via firewall rules, I observed that not all OSDs on that host were marked down (neither was the host), leading to blocked IO. I didn't wait for very long (only a few minutes, less than 5), because its a production system. I also didn't find the time to file a tracker issue. I observed this with mimic, but since you report it for Pacific I'm pretty sure its affecting all versions. My guess is that this is not part of the CI testing, at least not in a way that covers network cut-off. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Steve Baker <steve.bakerx1@xxxxxxxxx> Sent: Thursday, January 11, 2024 8:45 AM To: ceph-users@xxxxxxx Subject: Rack outage test failing when nodes get integrated again Hi, we're currently testing a ceph (v 16.2.14) cluster, 3 mon nodes, 6 osd nodes à 8 nvme ssd osds distributed over 3 racks. Daemons are deployed in containers with cephadm / podman. We got 2 pools on it, one with 3x replication and min_size=2, one with an EC (k3m3). With 1 mon node and 2 osd nodes in each rack, the crush rules are configured in a way (for the 3x pool chooseleaf_firstn rack, for the ec pool choose_indep 3 rack / chooseleaf_indep 2 host), so that a full rack can go down while the cluster stays accessible for client operations. Other options we have set are mon_osd_down_out_subtree_limit=host so that in case of a host/rack outage the cluster does not automatically start to backfill, but will continue to run in a degraded state until human interaction comes to fix it. Also we set mon_osd_reporter_subtree_level=rack. We tested - while under (synthetic test-)client load - what happens if we take a full rack (one mon node and 2 osd nodes) out of the cluster. We did that using iptables to block the nodes of the rack from other nodes of the cluster (global and cluster network), as well as from the clients. As expected, the remainder of the cluster continues to run in a degraded state without to start any backfilling or recovery processes. All client requests gets served while the rack is out. But then a strange thing happens when we take the rack (1mon node, 2 osd nodes) back into the cluster again by deleting all firewall rules with iptables -F at once. Some osds get integrated in the cluster again immediatelly but some others remain in state "down" for exactly 10 minutes. These osds that stay down for the 10 minutes are in a state where they still seem to not be able to reach other osd nodes (see heartbeat_check logs below). After these 10 minutes have passed, these osds come up as well but then at exactly that time, many PGs get stuck in peering state and other osds that were all the time in the cluster get slow requests and the cluster blocks client traffic (I think it's just the PGs stuck in peering soaking all the client threads). Then, exactly 45 minutes after the nodes of the rack were made reachable by iptables -F again, the situation recovers, peering succeeds and client load gets handled again. We have repeated this test several times and it's always exactly the same 10 min "down interval" and a 45 min affected client requests. When we integrate the nodes into the cluster again one after another with a delay of some minutes inbetween, this does not happen at all. I wonder what's happening there. It must be some kind of split-brain situation having to do with blocking the nodes using iptables but not rebooting them completelly. The 10 min and 45 min intervals I described occure every time. For the 10 minutes, some osds stay down after the hosts got integrated again. It's not all of the 16 osds from the 2 osd hosts that got integrated again but just some of them. Which ones varies randomly. Sometimes it's also only just one. We also observerd, the longer the hosts were out of the cluster, the more osds are affected. Then even after they get up again after 10 minutes, it takes another 45 minutes until the stuck peering situation resolves. Also during these 45 minutes, we see slow ops on osds thet remained into the cluster. #################################################### See here some OSD logs that get written after the reintegration: #################################################### 2024-01-04T08:25:03.856+0000 7f369132b700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2024-01-04T07:25:03.860426+0000) 2024-01-04T08:25:06.556+0000 7f3682882700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.0 down, but it is still running 2024-01-04T08:25:06.556+0000 7f3682882700 0 log_channel(cluster) log [DBG] : map e62160 wrongly marked me down at e62136 2024-01-04T08:25:06.556+0000 7f3682882700 1 osd.0 62160 start_waiting_for_healthy 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.5:6810 osd.2 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.8:6814 osd.3 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.6:6822 osd.4 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.9:6830 osd.5 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.6:6830 osd.7 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.8:6830 osd.9 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.5:6802 osd.10 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.9:6802 osd.11 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.6:6802 osd.13 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.8:6802 osd.15 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.5:6806 osd.16 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.9:6806 osd.17 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.6:6806 osd.20 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.8:6806 osd.21 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.9:6810 osd.22 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.5:6814 osd.23 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.6:6810 osd.26 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.8:6810 osd.27 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.9:6814 osd.28 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.5:6818 osd.29 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.6:6814 osd.32 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.8:6818 osd.33 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.9:6818 osd.34 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.5:6822 osd.35 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.6:6818 osd.37 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.9:6822 osd.39 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.8:6822 osd.40 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.5:6826 osd.41 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.6:6826 osd.43 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.9:6826 osd.44 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.8:6826 osd.46 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) 2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check: no reply from 1.2.3.5:6830 osd.47 ever on either front or back, first ping sent 2024-01-04T08:21:04.601131+0000 (oldest deadline 2024-01-04T08:21:24.601131+0000) [The block of loglines above gets repeated for 45 minutes until everything is fine again, the block below comes once at the beginning after the reintegration, then starts again after the 10 min interval and the repeats until the end of the 45 min interval until everything is fine again.] 2024-01-04T08:25:07.036+0000 7f3691b2c700 0 auth: could not find secret_id=1363 2024-01-04T08:25:07.036+0000 7f3691b2c700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=1363 2024-01-04T08:25:07.036+0000 7f369132b700 0 auth: could not find secret_id=1363 2024-01-04T08:25:07.036+0000 7f369132b700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=1363 2024-01-04T08:25:07.236+0000 7f369232d700 0 auth: could not find secret_id=1363 2024-01-04T08:25:07.236+0000 7f369232d700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=1363 2024-01-04T08:25:07.236+0000 7f369232d700 0 auth: could not find secret_id=1363 2024-01-04T08:25:07.236+0000 7f369232d700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=1363 2024-01-04T08:35:07.225+0000 7f369232d700 0 auth: could not find secret_id=1365 2024-01-04T08:35:07.225+0000 7f369232d700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=1365 [The block below gets logged for 10 minutes until the osd is not down anymore] 2024-01-04T08:25:08.368+0000 7f368d2a6700 1 osd.0 62162 is_healthy false -- only 0/10 up peers (less than 33%) 2024-01-04T08:25:08.368+0000 7f368d2a6700 1 osd.0 62162 not healthy; waiting to boot 2024-01-04T08:25:09.340+0000 7f368d2a6700 1 osd.0 62162 is_healthy false -- only 0/10 up peers (less than 33%) 2024-01-04T08:25:09.340+0000 7f368d2a6700 1 osd.0 62162 not healthy; waiting to boot 2024-01-04T08:25:10.316+0000 7f368d2a6700 1 osd.0 62162 is_healthy false -- only 0/10 up peers (less than 33%) 2024-01-04T08:25:10.316+0000 7f368d2a6700 1 osd.0 62162 not healthy; waiting to boot After 10 minutes then, the osd seems to reboot: 2024-01-04T08:35:07.005+0000 7f368d2a6700 1 osd.0 62509 start_boot 2024-01-04T08:35:07.009+0000 7f368b2a2700 1 osd.0 62509 set_numa_affinity storage numa node 0 2024-01-04T08:35:07.009+0000 7f368b2a2700 -1 osd.0 62509 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory 2024-01-04T08:35:07.009+0000 7f368b2a2700 1 osd.0 62509 set_numa_affinity not setting numa affinity 2024-01-04T08:35:07.197+0000 7f367ea40700 2 osd.0 62509 ms_handle_reset con 0x561d78ec6000 session 0x561d8ae5f0e0 2024-01-04T08:35:07.213+0000 7f3682882700 1 osd.0 62521 state: booting -> active ############################################################## See here some logs of the active mon that get written after the reintegration: ############################################################## 2024-01-04T08:25:06.486+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62160 send_latest to osd.0 v2:... start 62136 2024-01-04T08:25:06.486+0000 7ff5fa87b700 1 mon.ceph-mon01@0(leader).osd e62160 ignoring beacon from non-active osd.0 2024-01-04T08:25:06.490+0000 7ff5f9078700 0 log_channel(cluster) log [WRN] : osd.0 (root=default,rack=rack3,host=ceph-osd07) is down 2024-01-04T08:25:06.642+0000 7ff5f9078700 0 log_channel(cluster) log [INF] : osd.0 marked itself dead as of e62160 2024-01-04T08:25:07.434+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62161 preprocess_failure dne(/dup?): osd.0 [v2:...,v1:...], from osd.45 2024-01-04T08:29:59.998+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : osd.0 (root=default,rack=rack3,host=ceph-osd07) is down 2024-01-04T08:29:59.998+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : Slow OSD heartbeats on back from osd.45 [rack3] to osd.0 [rack3] (down) 221701.273 msec 2024-01-04T08:29:59.998+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : Slow OSD heartbeats on front from osd.45 [rack3] to osd.0 [rack3] (down) 221700.153 msec 2024-01-04T08:31:43.019+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62434 send_incremental [62416..62434] to osd.0 2024-01-04T08:32:13.915+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62443 send_incremental [62435..62443] to osd.0 2024-01-04T08:32:44.891+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62461 send_incremental [62444..62461] to osd.0 2024-01-04T08:33:15.752+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62465 send_incremental [62462..62465] to osd.0 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 preprocess_failure from dead osd.0, ignoring 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 send_incremental [62466..62483] to osd.0 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 preprocess_failure from dead osd.0, ignoring 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 send_incremental [62466..62483] to osd.0 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 preprocess_failure from dead osd.0, ignoring 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 send_incremental [62466..62483] to osd.0 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 preprocess_failure from dead osd.0, ignoring 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 send_incremental [62466..62483] to osd.0 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 preprocess_failure from dead osd.0, ignoring 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 send_incremental [62466..62483] to osd.0 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 preprocess_failure from dead osd.0, ignoring 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 send_incremental [62466..62483] to osd.0 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 preprocess_failure from dead osd.0, ignoring 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 send_incremental [62466..62483] to osd.0 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 preprocess_failure from dead osd.0, ignoring 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 send_incremental [62466..62483] to osd.0 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 preprocess_failure from dead osd.0, ignoring 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 send_incremental [62466..62483] to osd.0 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 preprocess_failure from dead osd.0, ignoring 2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01@0(leader).osd e62483 send_incremental [62466..62483] to osd.0 2024-01-04T08:33:51.240+0000 7ff5f9078700 5 mon.ceph-mon01@0(leader).osd e62484 send_incremental [62484..62484] to osd.0 2024-01-04T08:35:04.077+0000 7ff5fd080700 0 log_channel(cluster) log [INF] : Marking osd.0 out (has been down for 600 seconds) 2024-01-04T08:35:04.085+0000 7ff5fd080700 2 mon.ceph-mon01@0(leader).osd e62517 osd.0 OUT 2024-01-04T08:35:07.173+0000 7ff5fd080700 2 mon.ceph-mon01@0(leader).osd e62520 osd.0 UP [v2:...,v1:...] 2024-01-04T08:35:07.173+0000 7ff5fd080700 2 mon.ceph-mon01@0(leader).osd e62520 osd.0 IN This is what gets logged shortly before the situation recovers: 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN Reduced data availability: 292 pgs inactive, 292 pgs peering; Degraded data redundancy: 14614397/4091962683 objects degraded (0.357%), 322 pgs degraded, 360 pgs undersized; 2 pools have too many placement groups; 9472 slow ops, oldest one blocked for 2091 sec, daemons [osd.10,osd.11,osd.13,osd.14,osd.15,osd.16,osd.17> 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : [WRN] PG_AVAILABILITY: Reduced data availability: 292 pgs inactive, 292 pgs peering 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.212 is stuck peering for 34m, current state remapped+peering, last acting [4,21] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.220 is stuck peering for 34m, current state remapped+peering, last acting [3,35] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.227 is stuck peering for 34m, current state remapped+peering, last acting [44,26] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.228 is stuck peering for 34m, current state remapped+peering, last acting [23,5] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.22b is stuck peering for 34m, current state remapped+peering, last acting [32,9] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.234 is stuck peering for 34m, current state remapped+peering, last acting [32,44] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.24e is stuck peering for 34m, current state remapped+peering, last acting [17,32] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.255 is stuck peering for 34m, current state remapped+peering, last acting [4,22] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.260 is stuck peering for 34m, current state remapped+peering, last acting [17,47] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.267 is stuck peering for 34m, current state remapped+peering, last acting [44,23] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.27d is stuck peering for 34m, current state remapped+peering, last acting [4,21] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.289 is stuck peering for 34m, current state remapped+peering, last acting [13,9] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.292 is stuck peering for 34m, current state remapped+peering, last acting [15,10] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.297 is stuck peering for 34m, current state remapped+peering, last acting [2,21] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.29d is stuck peering for 34m, current state remapped+peering, last acting [40,23] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 14.2a8 is stuck peering for 34m, current state remapped+peering, last acting [33,4] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.20e is stuck inactive for 34m, current state remapped+peering, last acting [33,39,2147483647,2147483647,10,43] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.212 is stuck peering for 34m, current state remapped+peering, last acting [36,2147483647,22,40,10,43] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.219 is stuck peering for 34m, current state remapped+peering, last acting [13,10,21,34,2147483647,2147483647] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.21c is stuck peering for 34m, current state remapped+peering, last acting [41,4,2147483647,14,44,15] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.222 is stuck peering for 34m, current state remapped+peering, last acting [2147483647,2147483647,23,32,3,34] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.22d is stuck peering for 34m, current state remapped+peering, last acting [2147483647,45,41,20,17,33] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.233 is stuck peering for 34m, current state remapped+peering, last acting [4,2,27,34,14,2147483647] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.23a is stuck peering for 34m, current state remapped+peering, last acting [41,43,19,2147483647,34,33] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.23b is stuck peering for 34m, current state remapped+peering, last acting [2147483647,30,7,41,34,15] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.23e is stuck peering for 34m, current state remapped+peering, last acting [10,37,2147483647,2147483647,11,9] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.243 is stuck peering for 34m, current state remapped+peering, last acting [23,13,11,15,2147483647,45] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.244 is stuck peering for 34m, current state remapped+peering, last acting [13,35,14,2147483647,17,9] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.249 is stuck peering for 34m, current state remapped+peering, last acting [32,47,2147483647,2147483647,46,17] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.24a is stuck peering for 34m, current state remapped+peering, last acting [47,7,15,5,2147483647,2147483647] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.24b is stuck peering for 34m, current state remapped+peering, last acting [30,2147483647,46,28,4,29] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.24f is stuck peering for 34m, current state remapped+peering, last acting [2147483647,2147483647,13,35,33,5] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.254 is stuck peering for 34m, current state remapped+peering, last acting [15,39,2147483647,2147483647,4,16] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.257 is stuck peering for 34m, current state remapped+peering, last acting [13,10,2147483647,2147483647,22,40] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.25d is stuck peering for 34m, current state remapped+peering, last acting [2147483647,2147483647,20,16,34,46] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.26f is stuck peering for 34m, current state remapped+peering, last acting [33,17,2147483647,14,26,23] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.273 is stuck peering for 34m, current state remapped+peering, last acting [36,2147483647,29,4,17,21] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.27a is stuck peering for 34m, current state remapped+peering, last acting [40,34,2147483647,2147483647,26,23] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.27b is stuck peering for 34m, current state remapped+peering, last acting [41,37,2147483647,19,11,3] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.27e is stuck peering for 34m, current state remapped+peering, last acting [44,9,4,41,45,2147483647] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.281 is stuck peering for 34m, current state remapped+peering, last acting [2,43,21,17,31,2147483647] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.290 is stuck peering for 34m, current state remapped+peering, last acting [17,21,2147483647,2147483647,23,37] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.296 is stuck peering for 34m, current state remapped+peering, last acting [44,3,35,20,2147483647,2147483647] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.298 is stuck peering for 34m, current state remapped+peering, last acting [2147483647,36,43,23,44,9] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.2a5 is stuck peering for 34m, current state remapped+peering, last acting [33,11,19,2147483647,43,10] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.2a6 is stuck peering for 34m, current state remapped+peering, last acting [2147483647,2147483647,4,29,27,34] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.2a8 is stuck peering for 34m, current state remapped+peering, last acting [47,4,2147483647,2147483647,15,5] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.2ae is stuck peering for 34m, current state remapped+peering, last acting [23,32,2147483647,2147483647,3,22] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.2b4 is stuck peering for 34m, current state remapped+peering, last acting [2147483647,2147483647,23,20,15,34] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.2b7 is stuck peering for 34m, current state remapped+peering, last acting [17,3,2147483647,45,41,32] 2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : pg 15.2b8 is stuck peering for 34m, current state remapped+peering, last acting [40,44,2147483647,2147483647,43,10] 2024-01-04T09:10:02.572+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : Health check update: Degraded data redundancy: 14296421/4091962644 objects degraded (0.349%), 322 pgs degraded, 360 pgs undersized (PG_DEGRADED) 2024-01-04T09:10:02.572+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : Health check update: 9127 slow ops, oldest one blocked for 2096 sec, daemons [osd.10,osd.11,osd.13,osd.14,osd.15,osd.16,osd.17,osd.2,osd.20,osd.21]... have slow ops. (SLOW_OPS) 2024-01-04T09:10:27.608+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : Health check update: Slow OSD heartbeats on back (longest 429402.973ms) (OSD_SLOW_PING_TIME_BACK) 2024-01-04T09:10:27.608+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : Health check update: Slow OSD heartbeats on front (longest 429531.265ms) (OSD_SLOW_PING_TIME_FRONT) 2024-01-04T09:10:27.608+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : Health check update: Degraded data redundancy: 96918631/4091964189 objects degraded (2.369%), 404 pgs degraded, 241 pgs undersized (PG_DEGRADED) 2024-01-04T09:10:27.608+0000 7ff5fd080700 0 log_channel(cluster) log [WRN] : Health check update: 706 slow ops, oldest one blocked for 2121 sec, daemons [osd.10,osd.11,osd.12,osd.13,osd.15,osd.16,osd.17,osd.18,osd.2,osd.20]... have slow ops. (SLOW_OPS) Any ideas what's going on here ? _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx