Release: 16.2.7 (pacific) Infra: 4 x Nodes (4xOSD HDD), 3 x Nodes (mon/mds, 1 x OSD NVMe) We recently had a couple of node which went offline unexpectedly triggering a rebalance which is still ongoing. The OSDs on the restarted node are marked as down and they keep showing in the log `authenticated timed out`, after a period of time they get marked `autoout`. We tried setting `noout` on the cluster which has stopped them being marked out but they still never authenticate. We can access all the ceph tooling from those nodes which indicates connection to mons. The node keyring/time are both in sync. We are at a loss to why we can not get the OSDs to authenticate. Any help would be apreciated. ``` cluster: id: d5126e5a-882e-11ec-954e-90e2baec3d2c health: HEALTH_WARN 7 failed cephadm daemon(s) 2 stray daemon(s) not managed by cephadm insufficient standby MDS daemons available nodown,noout flag(s) set 8 osds down 2 hosts (8 osds) down Degraded data redundancy: 195930251/392039621 objects degraded (49.977%), 160 pgs degraded, 160 pgs undersized 2 pgs not deep-scrubbed in time services: mon: 3 daemons, quorum ceph5,ceph7,ceph6 (age 38h) mgr: ceph2.tofizp(active, since 9M), standbys: ceph1.vnkagp mds: 3/3 daemons up osd: 19 osds: 11 up (since 38h), 19 in (since 45h); 5 remapped pgs flags nodown,noout data: volumes: 1/1 healthy pools: 6 pools, 257 pgs objects: 102.94M objects, 67 TiB usage: 68 TiB used, 50 TiB / 118 TiB avail pgs: 195930251/392039621 objects degraded (49.977%) 3205811/392039621 objects misplaced (0.818%) 155 active+undersized+degraded 97 active+clean 3 active+undersized+degraded+remapped+backfill_wait 2 active+undersized+degraded+remapped+backfilling io: client: 511 B/s rd, 102 KiB/s wr, 0 op/s rd, 2 op/s wr recovery: 13 MiB/s, 16 objects/s ``` _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx