Best way to add back a host after removing offline - cephadm

Kirby Haze <kirbyhaze01@xxxxxxxxx> · Wed, 5 Feb 2025 22:58:39 -0800

I have a 3 node test 17.2.7 cluster, and I decided to power one of the
hosts down which contained a mon, 6 osds and a standby mgr. Then I used the
`ceph orch host rm <host> --force --rm` to remove the host (after powering
it down).

All of this looks expected except the logs after removing the host. These
pools have size 3 min_size 2. Actually I'm unsure what would happen if I
were even to successfully add back the host as those objects would be
degraded. Besides the error from cephadm, what is ceph thinking when I
pulled the plug then removed that host? Am I guaranteed some type of
recovery if I do successfully add back that host?

----
Before removing host

root@ceph-test-2:/# ceph -s
  cluster:
    id:     fca870d8-e431-11ef-8000-bc2411363b7d
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-test-2,ceph-test-3,ceph-test-4 (age 29m)
    mgr: ceph-test-2.vbjhdq(active, since 31m), standbys: ceph-test-4.jjubsa
    osd: 18 osds: 18 up (since 22m), 18 in (since 28m)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    pools:   5 pools, 129 pgs
    objects: 223 objects, 581 KiB
    usage:   5.2 GiB used, 8.8 TiB / 8.8 TiB avail
    pgs:     129 active+clean

root@ceph-test-2:/# ceph orch host ls
HOST         ADDR       LABELS      STATUS
ceph-test-2  10.0.0.52  _admin,rgw
ceph-test-3  10.0.0.53
ceph-test-4  10.0.0.54

-------------
After removing host

# ceph orch host rm ceph-test-4 --offline --force
Removed offline host 'ceph-test-4'

# ceph orch ps
mon.ceph-test-4             ceph-test-4               stopped
osd.1                       ceph-test-4               error
osd.7                       ceph-test-4               error
....
....

# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME             STATUS  REWEIGHT  PRI-AFF
-1         5.85956  root default
-5         2.92978      host ceph-test-2
 2    hdd  0.48830          osd.2             up   1.00000  1.00000
 4    hdd  0.48830          osd.4             up   1.00000  1.00000
 8    hdd  0.48830          osd.8             up   1.00000  1.00000
11    hdd  0.48830          osd.11            up   1.00000  1.00000
14    hdd  0.48830          osd.14            up   1.00000  1.00000
16    hdd  0.48830          osd.16            up   1.00000  1.00000
-3         2.92978      host ceph-test-3
 0    hdd  0.48830          osd.0             up   1.00000  1.00000
 3    hdd  0.48830          osd.3             up   1.00000  1.00000
 6    hdd  0.48830          osd.6             up   1.00000  1.00000
 9    hdd  0.48830          osd.9             up   1.00000  1.00000
12    hdd  0.48830          osd.12            up   1.00000  1.00000
15    hdd  0.48830          osd.15            up   1.00000  1.00000

root@ceph-test-2:/rootfs/root# ceph -s
  cluster:
    id:     fca870d8-e431-11ef-8000-bc2411363b7d
    health: HEALTH_WARN
            6 failed cephadm daemon(s)
            Degraded data redundancy: 145/669 objects degraded (21.674%),
24 pgs degraded, 71 pgs undersized

  services:
    mon: 2 daemons, quorum ceph-test-2,ceph-test-3 (age 50m)
    mgr: ceph-test-2.vbjhdq(active, since 2h), standbys: ceph-test-3.wzmioq
    osd: 12 osds: 12 up (since 51m), 12 in (since 2h); 58 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    pools:   5 pools, 129 pgs
    objects: 223 objects, 581 KiB
    usage:   3.5 GiB used, 5.9 TiB / 5.9 TiB avail
    pgs:     145/669 objects degraded (21.674%)
             75/669 objects misplaced (11.211%)
             54 active+clean+remapped
             47 active+undersized
             24 active+undersized+degraded
             4  active+clean

  progress:
    Global Recovery Event (50m)
      [================............] (remaining: 37m)

The cephadm logs show this as well

2025-02-06T06:27:46.936+0000 7f85026a4700 -1 log_channel(cephadm) log [ERR]
: auth get failed: failed to find osd.7 in keyring retval: -2
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1002, in _check_daemons
    self.mgr._daemon_action(daemon_spec, action=action)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 2136, in _daemon_action
    daemon_spec)
  File "/usr/share/ceph/mgr/cephadm/services/cephadmservice.py", line 494,
in generate_config
    extra_ceph_config=daemon_spec.ceph_conf)
  File "/usr/share/ceph/mgr/cephadm/services/cephadmservice.py", line 520,
in get_config_and_keyring
    'entity': entity,
  File "/usr/share/ceph/mgr/mgr_module.py", line 1593, in check_mon_command
    raise MonCommandFailed(f'{cmd_dict["prefix"]} failed: {r.stderr}
retval: {r.retval}')
mgr_module.MonCommandFailed: auth get failed: failed to find osd.7 in
keyring retval: -2
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx