I have a 3 node test 17.2.7 cluster, and I decided to power one of the hosts down which contained a mon, 6 osds and a standby mgr. Then I used the `ceph orch host rm <host> --force --rm` to remove the host (after powering it down). All of this looks expected except the logs after removing the host. These pools have size 3 min_size 2. Actually I'm unsure what would happen if I were even to successfully add back the host as those objects would be degraded. Besides the error from cephadm, what is ceph thinking when I pulled the plug then removed that host? Am I guaranteed some type of recovery if I do successfully add back that host? ---- Before removing host root@ceph-test-2:/# ceph -s cluster: id: fca870d8-e431-11ef-8000-bc2411363b7d health: HEALTH_OK services: mon: 3 daemons, quorum ceph-test-2,ceph-test-3,ceph-test-4 (age 29m) mgr: ceph-test-2.vbjhdq(active, since 31m), standbys: ceph-test-4.jjubsa osd: 18 osds: 18 up (since 22m), 18 in (since 28m) rgw: 1 daemon active (1 hosts, 1 zones) data: pools: 5 pools, 129 pgs objects: 223 objects, 581 KiB usage: 5.2 GiB used, 8.8 TiB / 8.8 TiB avail pgs: 129 active+clean root@ceph-test-2:/# ceph orch host ls HOST ADDR LABELS STATUS ceph-test-2 10.0.0.52 _admin,rgw ceph-test-3 10.0.0.53 ceph-test-4 10.0.0.54 ------------- After removing host # ceph orch host rm ceph-test-4 --offline --force Removed offline host 'ceph-test-4' # ceph orch ps mon.ceph-test-4 ceph-test-4 stopped osd.1 ceph-test-4 error osd.7 ceph-test-4 error .... .... # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 5.85956 root default -5 2.92978 host ceph-test-2 2 hdd 0.48830 osd.2 up 1.00000 1.00000 4 hdd 0.48830 osd.4 up 1.00000 1.00000 8 hdd 0.48830 osd.8 up 1.00000 1.00000 11 hdd 0.48830 osd.11 up 1.00000 1.00000 14 hdd 0.48830 osd.14 up 1.00000 1.00000 16 hdd 0.48830 osd.16 up 1.00000 1.00000 -3 2.92978 host ceph-test-3 0 hdd 0.48830 osd.0 up 1.00000 1.00000 3 hdd 0.48830 osd.3 up 1.00000 1.00000 6 hdd 0.48830 osd.6 up 1.00000 1.00000 9 hdd 0.48830 osd.9 up 1.00000 1.00000 12 hdd 0.48830 osd.12 up 1.00000 1.00000 15 hdd 0.48830 osd.15 up 1.00000 1.00000 root@ceph-test-2:/rootfs/root# ceph -s cluster: id: fca870d8-e431-11ef-8000-bc2411363b7d health: HEALTH_WARN 6 failed cephadm daemon(s) Degraded data redundancy: 145/669 objects degraded (21.674%), 24 pgs degraded, 71 pgs undersized services: mon: 2 daemons, quorum ceph-test-2,ceph-test-3 (age 50m) mgr: ceph-test-2.vbjhdq(active, since 2h), standbys: ceph-test-3.wzmioq osd: 12 osds: 12 up (since 51m), 12 in (since 2h); 58 remapped pgs rgw: 1 daemon active (1 hosts, 1 zones) data: pools: 5 pools, 129 pgs objects: 223 objects, 581 KiB usage: 3.5 GiB used, 5.9 TiB / 5.9 TiB avail pgs: 145/669 objects degraded (21.674%) 75/669 objects misplaced (11.211%) 54 active+clean+remapped 47 active+undersized 24 active+undersized+degraded 4 active+clean progress: Global Recovery Event (50m) [================............] (remaining: 37m) The cephadm logs show this as well 2025-02-06T06:27:46.936+0000 7f85026a4700 -1 log_channel(cephadm) log [ERR] : auth get failed: failed to find osd.7 in keyring retval: -2 Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/serve.py", line 1002, in _check_daemons self.mgr._daemon_action(daemon_spec, action=action) File "/usr/share/ceph/mgr/cephadm/module.py", line 2136, in _daemon_action daemon_spec) File "/usr/share/ceph/mgr/cephadm/services/cephadmservice.py", line 494, in generate_config extra_ceph_config=daemon_spec.ceph_conf) File "/usr/share/ceph/mgr/cephadm/services/cephadmservice.py", line 520, in get_config_and_keyring 'entity': entity, File "/usr/share/ceph/mgr/mgr_module.py", line 1593, in check_mon_command raise MonCommandFailed(f'{cmd_dict["prefix"]} failed: {r.stderr} retval: {r.retval}') mgr_module.MonCommandFailed: auth get failed: failed to find osd.7 in keyring retval: -2 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx