Re: Rack outage test failing when nodes get integrated again

Frank Schilder <frans@xxxxxx> · Thu, 11 Jan 2024 11:20:22 +0000

Hi Steve,

I also observed that setting mon_osd_reporter_subtree_level to anything else than host leads to incorrect behavior.

In our case, I actually observed the opposite. I had mon_osd_reporter_subtree_level=datacenter (we have 3 DCs in the crush tree). After cutting off a single host with ifdown - also a network cut-off albeit not via firewall rules, I observed that not all OSDs on that host were marked down (neither was the host), leading to blocked IO. I didn't wait for very long (only a few minutes, less than 5), because its a production system. I also didn't find the time to file a tracker issue. I observed this with mimic, but since you report it for Pacific I'm pretty sure its affecting all versions.

My guess is that this is not part of the CI testing, at least not in a way that covers network cut-off.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Steve Baker <steve.bakerx1@xxxxxxxxx>
Sent: Thursday, January 11, 2024 8:45 AM
To: ceph-users@xxxxxxx
Subject:  Rack outage test failing when nodes get integrated again

Hi, we're currently testing a ceph (v 16.2.14) cluster, 3 mon nodes, 6 osd
nodes à 8 nvme ssd osds distributed over 3 racks. Daemons are deployed in
containers with cephadm / podman. We got 2 pools on it, one with 3x
replication and min_size=2, one with an EC (k3m3). With 1 mon node and 2
osd nodes in each rack, the crush rules are configured in a way (for the 3x
pool chooseleaf_firstn rack, for the ec pool choose_indep 3 rack /
chooseleaf_indep 2 host), so that a full rack can go down while the cluster
stays accessible for client operations. Other options we have set are
mon_osd_down_out_subtree_limit=host so that in case of a host/rack outage
the cluster does not automatically start to backfill, but will continue to
run in a degraded state until human interaction comes to fix it. Also we
set mon_osd_reporter_subtree_level=rack.

We tested - while under (synthetic test-)client load - what happens if we
take a full rack (one mon node and 2 osd nodes) out of the cluster. We did
that using iptables to block the nodes of the rack from other nodes of the
cluster (global and cluster network), as well as from the clients. As
expected, the remainder of the cluster continues to run in a degraded state
without to start any backfilling or recovery processes. All client requests
gets served while the rack is out.

But then a strange thing happens when we take the rack (1mon node, 2 osd
nodes) back into the cluster again by deleting all firewall rules with
iptables -F at once. Some osds get integrated in the cluster again
immediatelly but some others remain in state "down" for exactly 10 minutes.
These osds that stay down for the 10 minutes are in a state where they
still seem to not be able to reach other osd nodes (see heartbeat_check
logs below). After these 10 minutes have passed, these osds come up as well
but then at exactly that time, many PGs get stuck in peering state and
other osds that were all the time in the cluster get slow requests and the
cluster blocks client traffic (I think it's just the PGs stuck in peering
soaking all the client threads). Then, exactly 45 minutes after the nodes
of the rack were made reachable by iptables -F again, the situation
recovers, peering succeeds and client load gets handled again.

We have repeated this test several times and it's always exactly the same
10 min "down interval" and a 45 min affected client requests. When we
integrate the nodes into the cluster again one after another with a delay
of some minutes inbetween, this does not happen at all. I wonder what's
happening there. It must be some kind of split-brain situation having to do
with blocking the nodes using iptables but not rebooting them completelly.
The 10 min and 45 min intervals I described occure every time. For the 10
minutes, some osds stay down after the hosts got integrated again. It's not
all of the 16 osds from the 2 osd hosts that got integrated again but just
some of them. Which ones varies randomly. Sometimes it's also only just
one. We also observerd, the longer the hosts were out of the cluster, the
more osds are affected. Then even after they get up again after 10 minutes,
it takes another 45 minutes until the stuck peering situation resolves.
Also during these 45 minutes, we see slow ops on osds thet remained into
the cluster.

####################################################
See here some OSD logs that get written after the reintegration:
####################################################

2024-01-04T08:25:03.856+0000 7f369132b700 -1 monclient:
_check_auth_rotating possible clock skew, rotating keys expired way too
early (before 2024-01-04T07:25:03.860426+0000)
2024-01-04T08:25:06.556+0000 7f3682882700  0 log_channel(cluster) log [WRN]
: Monitor daemon marked osd.0 down, but it is still running
2024-01-04T08:25:06.556+0000 7f3682882700  0 log_channel(cluster) log [DBG]
: map e62160 wrongly marked me down at e62136
2024-01-04T08:25:06.556+0000 7f3682882700  1 osd.0 62160
start_waiting_for_healthy

2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6810 osd.2 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6814 osd.3 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6822 osd.4 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6830 osd.5 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6830 osd.7 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6830 osd.9 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6802 osd.10 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6802 osd.11 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6802 osd.13 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6802 osd.15 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6806 osd.16 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6806 osd.17 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6806 osd.20 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6806 osd.21 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6810 osd.22 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6814 osd.23 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6810 osd.26 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6810 osd.27 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6814 osd.28 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6818 osd.29 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6814 osd.32 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6818 osd.33 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6818 osd.34 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6822 osd.35 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6818 osd.37 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6822 osd.39 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6822 osd.40 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6826 osd.41 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6826 osd.43 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6826 osd.44 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6826 osd.46 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6830 osd.47 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)

[The block of loglines above gets repeated for 45 minutes until everything
is fine again, the block below comes once at the beginning after the
reintegration, then starts again after the 10 min interval and the repeats
until the end of the 45 min interval until everything is fine again.]

2024-01-04T08:25:07.036+0000 7f3691b2c700  0 auth: could not find
secret_id=1363
2024-01-04T08:25:07.036+0000 7f3691b2c700  0 cephx: verify_authorizer could
not get service secret for service osd secret_id=1363
2024-01-04T08:25:07.036+0000 7f369132b700  0 auth: could not find
secret_id=1363
2024-01-04T08:25:07.036+0000 7f369132b700  0 cephx: verify_authorizer could
not get service secret for service osd secret_id=1363
2024-01-04T08:25:07.236+0000 7f369232d700  0 auth: could not find
secret_id=1363
2024-01-04T08:25:07.236+0000 7f369232d700  0 cephx: verify_authorizer could
not get service secret for service osd secret_id=1363
2024-01-04T08:25:07.236+0000 7f369232d700  0 auth: could not find
secret_id=1363
2024-01-04T08:25:07.236+0000 7f369232d700  0 cephx: verify_authorizer could
not get service secret for service osd secret_id=1363
2024-01-04T08:35:07.225+0000 7f369232d700  0 auth: could not find
secret_id=1365
2024-01-04T08:35:07.225+0000 7f369232d700  0 cephx: verify_authorizer could
not get service secret for service osd secret_id=1365

[The block below gets logged for 10 minutes until the osd is not down
anymore]

2024-01-04T08:25:08.368+0000 7f368d2a6700  1 osd.0 62162 is_healthy false
-- only 0/10 up peers (less than 33%)
2024-01-04T08:25:08.368+0000 7f368d2a6700  1 osd.0 62162 not healthy;
waiting to boot
2024-01-04T08:25:09.340+0000 7f368d2a6700  1 osd.0 62162 is_healthy false
-- only 0/10 up peers (less than 33%)
2024-01-04T08:25:09.340+0000 7f368d2a6700  1 osd.0 62162 not healthy;
waiting to boot
2024-01-04T08:25:10.316+0000 7f368d2a6700  1 osd.0 62162 is_healthy false
-- only 0/10 up peers (less than 33%)
2024-01-04T08:25:10.316+0000 7f368d2a6700  1 osd.0 62162 not healthy;
waiting to boot

After 10 minutes then, the osd seems to reboot:

2024-01-04T08:35:07.005+0000 7f368d2a6700  1 osd.0 62509 start_boot
2024-01-04T08:35:07.009+0000 7f368b2a2700  1 osd.0 62509 set_numa_affinity
storage numa node 0
2024-01-04T08:35:07.009+0000 7f368b2a2700 -1 osd.0 62509 set_numa_affinity
unable to identify public interface '' numa node: (2) No such file or
directory
2024-01-04T08:35:07.009+0000 7f368b2a2700  1 osd.0 62509 set_numa_affinity
not setting numa affinity
2024-01-04T08:35:07.197+0000 7f367ea40700  2 osd.0 62509 ms_handle_reset
con 0x561d78ec6000 session 0x561d8ae5f0e0
2024-01-04T08:35:07.213+0000 7f3682882700  1 osd.0 62521 state: booting ->
active

##############################################################
See here some logs of the active mon that get written after the
reintegration:
##############################################################

2024-01-04T08:25:06.486+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62160 send_latest to osd.0 v2:... start 62136
2024-01-04T08:25:06.486+0000 7ff5fa87b700  1 mon.ceph-mon01@0(leader).osd
e62160  ignoring beacon from non-active osd.0
2024-01-04T08:25:06.490+0000 7ff5f9078700  0 log_channel(cluster) log [WRN]
:     osd.0 (root=default,rack=rack3,host=ceph-osd07) is down
2024-01-04T08:25:06.642+0000 7ff5f9078700  0 log_channel(cluster) log [INF]
: osd.0 marked itself dead as of e62160
2024-01-04T08:25:07.434+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62161 preprocess_failure dne(/dup?): osd.0 [v2:...,v1:...], from osd.45
2024-01-04T08:29:59.998+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     osd.0 (root=default,rack=rack3,host=ceph-osd07) is down
2024-01-04T08:29:59.998+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     Slow OSD heartbeats on back from osd.45 [rack3] to osd.0 [rack3]
(down) 221701.273 msec
2024-01-04T08:29:59.998+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     Slow OSD heartbeats on front from osd.45 [rack3] to osd.0 [rack3]
(down) 221700.153 msec
2024-01-04T08:31:43.019+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62434 send_incremental [62416..62434] to osd.0
2024-01-04T08:32:13.915+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62443 send_incremental [62435..62443] to osd.0
2024-01-04T08:32:44.891+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62461 send_incremental [62444..62461] to osd.0
2024-01-04T08:33:15.752+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62465 send_incremental [62462..62465] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700  5 mon.ceph-mon01@0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:51.240+0000 7ff5f9078700  5 mon.ceph-mon01@0(leader).osd
e62484 send_incremental [62484..62484] to osd.0
2024-01-04T08:35:04.077+0000 7ff5fd080700  0 log_channel(cluster) log [INF]
: Marking osd.0 out (has been down for 600 seconds)
2024-01-04T08:35:04.085+0000 7ff5fd080700  2 mon.ceph-mon01@0(leader).osd
e62517  osd.0 OUT
2024-01-04T08:35:07.173+0000 7ff5fd080700  2 mon.ceph-mon01@0(leader).osd
e62520  osd.0 UP [v2:...,v1:...]
2024-01-04T08:35:07.173+0000 7ff5fd080700  2 mon.ceph-mon01@0(leader).osd
e62520  osd.0 IN

This is what gets logged shortly before the situation recovers:

2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
: Health detail: HEALTH_WARN Reduced data availability: 292 pgs inactive,
292 pgs peering; Degraded data redundancy: 14614397/4091962683 objects
degraded (0.357%), 322 pgs degraded, 360 pgs undersized; 2 pools have too
many placement groups; 9472 slow ops, oldest one blocked for 2091 sec,
daemons [osd.10,osd.11,osd.13,osd.14,osd.15,osd.16,osd.17>
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
: [WRN] PG_AVAILABILITY: Reduced data availability: 292 pgs inactive, 292
pgs peering
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.212 is stuck peering for 34m, current state remapped+peering,
last acting [4,21]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.220 is stuck peering for 34m, current state remapped+peering,
last acting [3,35]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.227 is stuck peering for 34m, current state remapped+peering,
last acting [44,26]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.228 is stuck peering for 34m, current state remapped+peering,
last acting [23,5]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.22b is stuck peering for 34m, current state remapped+peering,
last acting [32,9]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.234 is stuck peering for 34m, current state remapped+peering,
last acting [32,44]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.24e is stuck peering for 34m, current state remapped+peering,
last acting [17,32]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.255 is stuck peering for 34m, current state remapped+peering,
last acting [4,22]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.260 is stuck peering for 34m, current state remapped+peering,
last acting [17,47]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.267 is stuck peering for 34m, current state remapped+peering,
last acting [44,23]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.27d is stuck peering for 34m, current state remapped+peering,
last acting [4,21]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.289 is stuck peering for 34m, current state remapped+peering,
last acting [13,9]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.292 is stuck peering for 34m, current state remapped+peering,
last acting [15,10]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.297 is stuck peering for 34m, current state remapped+peering,
last acting [2,21]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.29d is stuck peering for 34m, current state remapped+peering,
last acting [40,23]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 14.2a8 is stuck peering for 34m, current state remapped+peering,
last acting [33,4]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.20e is stuck inactive for 34m, current state remapped+peering,
last acting [33,39,2147483647,2147483647,10,43]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.212 is stuck peering for 34m, current state remapped+peering,
last acting [36,2147483647,22,40,10,43]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.219 is stuck peering for 34m, current state remapped+peering,
last acting [13,10,21,34,2147483647,2147483647]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.21c is stuck peering for 34m, current state remapped+peering,
last acting [41,4,2147483647,14,44,15]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.222 is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,2147483647,23,32,3,34]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.22d is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,45,41,20,17,33]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.233 is stuck peering for 34m, current state remapped+peering,
last acting [4,2,27,34,14,2147483647]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.23a is stuck peering for 34m, current state remapped+peering,
last acting [41,43,19,2147483647,34,33]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.23b is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,30,7,41,34,15]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.23e is stuck peering for 34m, current state remapped+peering,
last acting [10,37,2147483647,2147483647,11,9]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.243 is stuck peering for 34m, current state remapped+peering,
last acting [23,13,11,15,2147483647,45]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.244 is stuck peering for 34m, current state remapped+peering,
last acting [13,35,14,2147483647,17,9]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.249 is stuck peering for 34m, current state remapped+peering,
last acting [32,47,2147483647,2147483647,46,17]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.24a is stuck peering for 34m, current state remapped+peering,
last acting [47,7,15,5,2147483647,2147483647]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.24b is stuck peering for 34m, current state remapped+peering,
last acting [30,2147483647,46,28,4,29]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.24f is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,2147483647,13,35,33,5]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.254 is stuck peering for 34m, current state remapped+peering,
last acting [15,39,2147483647,2147483647,4,16]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.257 is stuck peering for 34m, current state remapped+peering,
last acting [13,10,2147483647,2147483647,22,40]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.25d is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,2147483647,20,16,34,46]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.26f is stuck peering for 34m, current state remapped+peering,
last acting [33,17,2147483647,14,26,23]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.273 is stuck peering for 34m, current state remapped+peering,
last acting [36,2147483647,29,4,17,21]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.27a is stuck peering for 34m, current state remapped+peering,
last acting [40,34,2147483647,2147483647,26,23]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.27b is stuck peering for 34m, current state remapped+peering,
last acting [41,37,2147483647,19,11,3]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.27e is stuck peering for 34m, current state remapped+peering,
last acting [44,9,4,41,45,2147483647]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.281 is stuck peering for 34m, current state remapped+peering,
last acting [2,43,21,17,31,2147483647]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.290 is stuck peering for 34m, current state remapped+peering,
last acting [17,21,2147483647,2147483647,23,37]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.296 is stuck peering for 34m, current state remapped+peering,
last acting [44,3,35,20,2147483647,2147483647]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.298 is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,36,43,23,44,9]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.2a5 is stuck peering for 34m, current state remapped+peering,
last acting [33,11,19,2147483647,43,10]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.2a6 is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,2147483647,4,29,27,34]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.2a8 is stuck peering for 34m, current state remapped+peering,
last acting [47,4,2147483647,2147483647,15,5]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.2ae is stuck peering for 34m, current state remapped+peering,
last acting [23,32,2147483647,2147483647,3,22]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.2b4 is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,2147483647,23,20,15,34]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.2b7 is stuck peering for 34m, current state remapped+peering,
last acting [17,3,2147483647,45,41,32]
2024-01-04T09:09:59.996+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
:     pg 15.2b8 is stuck peering for 34m, current state remapped+peering,
last acting [40,44,2147483647,2147483647,43,10]

2024-01-04T09:10:02.572+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
: Health check update: Degraded data redundancy: 14296421/4091962644
objects degraded (0.349%), 322 pgs degraded, 360 pgs undersized
(PG_DEGRADED)
2024-01-04T09:10:02.572+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
: Health check update: 9127 slow ops, oldest one blocked for 2096 sec,
daemons
[osd.10,osd.11,osd.13,osd.14,osd.15,osd.16,osd.17,osd.2,osd.20,osd.21]...
have slow ops. (SLOW_OPS)

2024-01-04T09:10:27.608+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
: Health check update: Slow OSD heartbeats on back (longest 429402.973ms)
(OSD_SLOW_PING_TIME_BACK)
2024-01-04T09:10:27.608+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
: Health check update: Slow OSD heartbeats on front (longest 429531.265ms)
(OSD_SLOW_PING_TIME_FRONT)
2024-01-04T09:10:27.608+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
: Health check update: Degraded data redundancy: 96918631/4091964189
objects degraded (2.369%), 404 pgs degraded, 241 pgs undersized
(PG_DEGRADED)
2024-01-04T09:10:27.608+0000 7ff5fd080700  0 log_channel(cluster) log [WRN]
: Health check update: 706 slow ops, oldest one blocked for 2121 sec,
daemons
[osd.10,osd.11,osd.12,osd.13,osd.15,osd.16,osd.17,osd.18,osd.2,osd.20]...
have slow ops. (SLOW_OPS)

Any ideas what's going on here ?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx