Hi,
just add the host back to the cluster with 'ceph orch host add ...'.
If it still has the cephadm pub key, the orchestrator would deploy the
missing mon daemon (depending on your actual mon spec) and a couple of
other services.
In the current state, there won't be any recovery because you have
only two hosts but your crush rule requires three. There are ways to
recover anyway, for example by editing the rule or reduce the pool
size to 2, but I would only do that in a test cluster.
Depending on what the host removal actually removed, you might be able
to just reintegrate the OSDs (ceph cephadm osd active <host>) and
recovery will kick in. In case the OSD keyrings are gone, you can
import them, they should be still present on the removed OSD host.
Regards,
Eugen
Zitat von Kirby Haze <kirbyhaze01@xxxxxxxxx>:
I have a 3 node test 17.2.7 cluster, and I decided to power one of the
hosts down which contained a mon, 6 osds and a standby mgr. Then I used the
`ceph orch host rm <host> --force --rm` to remove the host (after powering
it down).
All of this looks expected except the logs after removing the host. These
pools have size 3 min_size 2. Actually I'm unsure what would happen if I
were even to successfully add back the host as those objects would be
degraded. Besides the error from cephadm, what is ceph thinking when I
pulled the plug then removed that host? Am I guaranteed some type of
recovery if I do successfully add back that host?
----
Before removing host
root@ceph-test-2:/# ceph -s
cluster:
id: fca870d8-e431-11ef-8000-bc2411363b7d
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph-test-2,ceph-test-3,ceph-test-4 (age 29m)
mgr: ceph-test-2.vbjhdq(active, since 31m), standbys: ceph-test-4.jjubsa
osd: 18 osds: 18 up (since 22m), 18 in (since 28m)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
pools: 5 pools, 129 pgs
objects: 223 objects, 581 KiB
usage: 5.2 GiB used, 8.8 TiB / 8.8 TiB avail
pgs: 129 active+clean
root@ceph-test-2:/# ceph orch host ls
HOST ADDR LABELS STATUS
ceph-test-2 10.0.0.52 _admin,rgw
ceph-test-3 10.0.0.53
ceph-test-4 10.0.0.54
-------------
After removing host
# ceph orch host rm ceph-test-4 --offline --force
Removed offline host 'ceph-test-4'
# ceph orch ps
mon.ceph-test-4 ceph-test-4 stopped
osd.1 ceph-test-4 error
osd.7 ceph-test-4 error
....
....
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 5.85956 root default
-5 2.92978 host ceph-test-2
2 hdd 0.48830 osd.2 up 1.00000 1.00000
4 hdd 0.48830 osd.4 up 1.00000 1.00000
8 hdd 0.48830 osd.8 up 1.00000 1.00000
11 hdd 0.48830 osd.11 up 1.00000 1.00000
14 hdd 0.48830 osd.14 up 1.00000 1.00000
16 hdd 0.48830 osd.16 up 1.00000 1.00000
-3 2.92978 host ceph-test-3
0 hdd 0.48830 osd.0 up 1.00000 1.00000
3 hdd 0.48830 osd.3 up 1.00000 1.00000
6 hdd 0.48830 osd.6 up 1.00000 1.00000
9 hdd 0.48830 osd.9 up 1.00000 1.00000
12 hdd 0.48830 osd.12 up 1.00000 1.00000
15 hdd 0.48830 osd.15 up 1.00000 1.00000
root@ceph-test-2:/rootfs/root# ceph -s
cluster:
id: fca870d8-e431-11ef-8000-bc2411363b7d
health: HEALTH_WARN
6 failed cephadm daemon(s)
Degraded data redundancy: 145/669 objects degraded (21.674%),
24 pgs degraded, 71 pgs undersized
services:
mon: 2 daemons, quorum ceph-test-2,ceph-test-3 (age 50m)
mgr: ceph-test-2.vbjhdq(active, since 2h), standbys: ceph-test-3.wzmioq
osd: 12 osds: 12 up (since 51m), 12 in (since 2h); 58 remapped pgs
rgw: 1 daemon active (1 hosts, 1 zones)
data:
pools: 5 pools, 129 pgs
objects: 223 objects, 581 KiB
usage: 3.5 GiB used, 5.9 TiB / 5.9 TiB avail
pgs: 145/669 objects degraded (21.674%)
75/669 objects misplaced (11.211%)
54 active+clean+remapped
47 active+undersized
24 active+undersized+degraded
4 active+clean
progress:
Global Recovery Event (50m)
[================............] (remaining: 37m)
The cephadm logs show this as well
2025-02-06T06:27:46.936+0000 7f85026a4700 -1 log_channel(cephadm) log [ERR]
: auth get failed: failed to find osd.7 in keyring retval: -2
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1002, in _check_daemons
self.mgr._daemon_action(daemon_spec, action=action)
File "/usr/share/ceph/mgr/cephadm/module.py", line 2136, in _daemon_action
daemon_spec)
File "/usr/share/ceph/mgr/cephadm/services/cephadmservice.py", line 494,
in generate_config
extra_ceph_config=daemon_spec.ceph_conf)
File "/usr/share/ceph/mgr/cephadm/services/cephadmservice.py", line 520,
in get_config_and_keyring
'entity': entity,
File "/usr/share/ceph/mgr/mgr_module.py", line 1593, in check_mon_command
raise MonCommandFailed(f'{cmd_dict["prefix"]} failed: {r.stderr}
retval: {r.retval}')
mgr_module.MonCommandFailed: auth get failed: failed to find osd.7 in
keyring retval: -2
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx