Re: Best way to add back a host after removing offline - cephadm

Eugen Block <eblock@xxxxxx> · Thu, 06 Feb 2025 09:35:59 +0000

Hi,

just add the host back to the cluster with 'ceph orch host add ...'.  
If it still has the cephadm pub key, the orchestrator would deploy the  
missing mon daemon (depending on your actual mon spec) and a couple of  
other services.

In the current state, there won't be any recovery because you have  
only two hosts but your crush rule requires three. There are ways to  
recover anyway, for example by editing the rule or reduce the pool  
size to 2, but I would only do that in a test cluster.
Depending on what the host removal actually removed, you might be able  
to just reintegrate the OSDs (ceph cephadm osd active <host>) and  
recovery will kick in. In case the OSD keyrings are gone, you can  
import them, they should be still present on the removed OSD host.

Regards,
Eugen

Zitat von Kirby Haze <kirbyhaze01@xxxxxxxxx>:

I have a 3 node test 17.2.7 cluster, and I decided to power one of the
hosts down which contained a mon, 6 osds and a standby mgr. Then I used the
`ceph orch host rm <host> --force --rm` to remove the host (after powering
it down).

All of this looks expected except the logs after removing the host. These
pools have size 3 min_size 2. Actually I'm unsure what would happen if I
were even to successfully add back the host as those objects would be
degraded. Besides the error from cephadm, what is ceph thinking when I
pulled the plug then removed that host? Am I guaranteed some type of
recovery if I do successfully add back that host?

----
Before removing host

root@ceph-test-2:/# ceph -s
  cluster:
    id:     fca870d8-e431-11ef-8000-bc2411363b7d
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-test-2,ceph-test-3,ceph-test-4 (age 29m)
    mgr: ceph-test-2.vbjhdq(active, since 31m), standbys: ceph-test-4.jjubsa
    osd: 18 osds: 18 up (since 22m), 18 in (since 28m)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    pools:   5 pools, 129 pgs
    objects: 223 objects, 581 KiB
    usage:   5.2 GiB used, 8.8 TiB / 8.8 TiB avail
    pgs:     129 active+clean

root@ceph-test-2:/# ceph orch host ls
HOST         ADDR       LABELS      STATUS
ceph-test-2  10.0.0.52  _admin,rgw
ceph-test-3  10.0.0.53
ceph-test-4  10.0.0.54

-------------
After removing host

# ceph orch host rm ceph-test-4 --offline --force
Removed offline host 'ceph-test-4'

# ceph orch ps
mon.ceph-test-4             ceph-test-4               stopped
osd.1                       ceph-test-4               error
osd.7                       ceph-test-4               error
....
....

# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME             STATUS  REWEIGHT  PRI-AFF
-1         5.85956  root default
-5         2.92978      host ceph-test-2
 2    hdd  0.48830          osd.2             up   1.00000  1.00000
 4    hdd  0.48830          osd.4             up   1.00000  1.00000
 8    hdd  0.48830          osd.8             up   1.00000  1.00000
11    hdd  0.48830          osd.11            up   1.00000  1.00000
14    hdd  0.48830          osd.14            up   1.00000  1.00000
16    hdd  0.48830          osd.16            up   1.00000  1.00000
-3         2.92978      host ceph-test-3
 0    hdd  0.48830          osd.0             up   1.00000  1.00000
 3    hdd  0.48830          osd.3             up   1.00000  1.00000
 6    hdd  0.48830          osd.6             up   1.00000  1.00000
 9    hdd  0.48830          osd.9             up   1.00000  1.00000
12    hdd  0.48830          osd.12            up   1.00000  1.00000
15    hdd  0.48830          osd.15            up   1.00000  1.00000

root@ceph-test-2:/rootfs/root# ceph -s
  cluster:
    id:     fca870d8-e431-11ef-8000-bc2411363b7d
    health: HEALTH_WARN
            6 failed cephadm daemon(s)
            Degraded data redundancy: 145/669 objects degraded (21.674%),
24 pgs degraded, 71 pgs undersized

  services:
    mon: 2 daemons, quorum ceph-test-2,ceph-test-3 (age 50m)
    mgr: ceph-test-2.vbjhdq(active, since 2h), standbys: ceph-test-3.wzmioq
    osd: 12 osds: 12 up (since 51m), 12 in (since 2h); 58 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    pools:   5 pools, 129 pgs
    objects: 223 objects, 581 KiB
    usage:   3.5 GiB used, 5.9 TiB / 5.9 TiB avail
    pgs:     145/669 objects degraded (21.674%)
             75/669 objects misplaced (11.211%)
             54 active+clean+remapped
             47 active+undersized
             24 active+undersized+degraded
             4  active+clean

  progress:
    Global Recovery Event (50m)
      [================............] (remaining: 37m)

The cephadm logs show this as well

2025-02-06T06:27:46.936+0000 7f85026a4700 -1 log_channel(cephadm) log [ERR]
: auth get failed: failed to find osd.7 in keyring retval: -2
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1002, in _check_daemons
    self.mgr._daemon_action(daemon_spec, action=action)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 2136, in _daemon_action
    daemon_spec)
  File "/usr/share/ceph/mgr/cephadm/services/cephadmservice.py", line 494,
in generate_config
    extra_ceph_config=daemon_spec.ceph_conf)
  File "/usr/share/ceph/mgr/cephadm/services/cephadmservice.py", line 520,
in get_config_and_keyring
    'entity': entity,
  File "/usr/share/ceph/mgr/mgr_module.py", line 1593, in check_mon_command
    raise MonCommandFailed(f'{cmd_dict["prefix"]} failed: {r.stderr}
retval: {r.retval}')
mgr_module.MonCommandFailed: auth get failed: failed to find osd.7 in
keyring retval: -2
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx