I transitioned some servers to a new rack and now I'm having major issues with Ceph upon bringing things back up. I believe the issue may be related to the ceph nodes coming back up with different IPs before VLANs were set. That's just a guess because I can't think of any other reason this would happen. Current state: Every 2.0s: ceph -s cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022 cluster: id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d health: HEALTH_WARN 1 filesystem is degraded 2 MDSs report slow metadata IOs 2/5 mons down, quorum cn02,cn03,cn01 9 osds down 3 hosts (17 osds) down Reduced data availability: 97 pgs inactive, 9 pgs down Degraded data redundancy: 13860144/30824413 objects degraded (44.965%), 411 pgs degraded, 482 pgs undersized services: mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05, cn04 mgr: cn02.arszct(active, since 5m) mds: 2/2 daemons up, 2 standby osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs data: volumes: 1/2 healthy, 1 recovering pools: 8 pools, 545 pgs objects: 7.71M objects, 6.7 TiB usage: 15 TiB used, 39 TiB / 54 TiB avail pgs: 0.367% pgs unknown 17.431% pgs not active 13860144/30824413 objects degraded (44.965%) 1137693/30824413 objects misplaced (3.691%) 280 active+undersized+degraded 67 undersized+degraded+remapped+backfilling+peered 57 active+undersized+remapped 45 active+clean+remapped 44 active+undersized+degraded+remapped+backfilling 18 undersized+degraded+peered 10 active+undersized 9 down 7 active+clean 3 active+undersized+remapped+backfilling 2 active+undersized+degraded+remapped+backfill_wait 2 unknown 1 undersized+peered io: client: 170 B/s rd, 0 op/s rd, 0 op/s wr recovery: 168 MiB/s, 158 keys/s, 166 objects/s I have to disable and re-enable the dashboard just to use it. It seems to get bogged down after a few moments. The three servers that were moved to the new rack Ceph has marked as "Down", but if I do a cephadm host-check, they all seem to pass: ************************ ceph ************************ --------- cn01.ceph.--------- podman (/usr/bin/podman) version 4.0.2 is present systemctl is present lvcreate is present Unit chronyd.service is enabled and running Host looks OK --------- cn02.ceph.--------- podman (/usr/bin/podman) version 4.0.2 is present systemctl is present lvcreate is present Unit chronyd.service is enabled and running Host looks OK --------- cn03.ceph.--------- podman (/usr/bin/podman) version 4.0.2 is present systemctl is present lvcreate is present Unit chronyd.service is enabled and running Host looks OK --------- cn04.ceph.--------- podman (/usr/bin/podman) version 4.0.2 is present systemctl is present lvcreate is present Unit chronyd.service is enabled and running Host looks OK --------- cn05.ceph.--------- podman|docker (/usr/bin/podman) is present systemctl is present lvcreate is present Unit chronyd.service is enabled and running Host looks OK --------- cn06.ceph.--------- podman (/usr/bin/podman) version 4.0.2 is present systemctl is present lvcreate is present Unit chronyd.service is enabled and running Host looks OK It seems to be recovering with what it has left, but a large amount of OSDs are down. When trying to restart one of the down'd OSDs, I see a huge dump. Jul 25 03:19:38 cn06.ceph ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug 2022-07-25T10:19:38.532+0000 7fce14a6c080 0 osd.34 30689 done with init, starting boot process Jul 25 03:19:38 cn06.ceph ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug 2022-07-25T10:19:38.532+0000 7fce14a6c080 1 osd.34 30689 start_boot Jul 25 03:20:10 cn06.ceph ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug 2022-07-25T10:20:10.655+0000 7fcdfd12d700 1 osd.34 30689 start_boot Jul 25 03:20:41 cn06.ceph ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug 2022-07-25T10:20:41.159+0000 7fcdfd12d700 1 osd.34 30689 start_boot Jul 25 03:21:11 cn06.ceph ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug 2022-07-25T10:21:11.662+0000 7fcdfd12d700 1 osd.34 30689 start_boot At this point it just keeps printing start_boot, but the dashboard has it marked as "in" but "down". On these three hosts that moved, there were a bunch marked as "out" and "down", and some with "in" but "down". Not sure where to go next. I'm going to let the recovery continue and hope that my 4x replication on these pools saves me. Not sure where to go from here. Any help is very much appreciated. This Ceph cluster holds all of our Cloudstack images... it would be terrible to lose this data. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx