Re: ceph orch cannot refresh

Eugen Block <eblock@xxxxxx> · Tue, 17 Jan 2023 08:57:45 +0000

Hi,

have you tried a mgr failover? 'ceph mgr fail' should do the trick,  
because restarting a mgr daemon won't fail it over. You should be able  
to see hints in the active mgr logs what is failing, e.g. cephadm logs  
--name mgr.<MGR>.

Zitat von Nicola Mori <mori@xxxxxxxxxx>:

Dear Ceph users,

after a host failure in my cluster (quincy 17.2.3 managed by  
cephadm) it seems that ceph orch got somehow stuck and it cannot  
operate. For example, it seems that it cannot refresh the status of  
several services since about 20 hours:

# ceph orch ls
NAME                       PORTS        RUNNING  REFRESHED   AGE PLACEMENT
alertmanager               ?:9093,9094      1/1  3m ago      3M count:1
crash                                      9/10  20h ago     3M   *  
grafana                    ?:3000           1/1  3m ago      3M  
count:1
mds.wizard_fs                               0/3  <deleting>  13h  
bofur;balin;aka;count:3
mds.wizardfs                                2/3  20h ago     70m  
bofur;balin;aka;count:3
mgr                                         2/2  20h ago     15m  
bofur;balin;count:2
mon                                         4/5  20h ago     93m  
bofur;balin;aka;romolo;dwalin;count:5
node-exporter              ?:9100          9/10  20h ago     3M   *  
osd                                          24  3m ago      -  
<unmanaged>
osd.all-available-devices                    72  20h ago     4w   *  
prometheus                 ?:9095           1/1  3m ago      3M    
count:1

The failed machine (named bifur) is offline but still in the cluster  
since I'm planning to restore it:

# ceph orch host ls
HOST     ADDR           LABELS               STATUS
aka      172.16.253.7   _admin
balin    172.16.253.3
bifur    172.16.253.5   _admin               Offline
bofur    172.16.253.2   _admin
dwalin   172.16.253.10
ogion    172.16.253.6   _no_autotune_memory
prestno  172.16.253.9
remolo   172.16.253.1
rokanan  172.16.253.8
romolo   172.16.253.4
10 hosts in cluster

Since this machine hosted a mon I tried to redeploy it with:

# ceph orch apply mon --placement="5 bofur balin aka romolo dwalin"

but even if ceph orch ls shows that the mons should currently be on  
the machines specified buy --placement (see above) it seems that  
somehow the mon on bifur is somehow still present in ceph orch  
status, e.g.

# ceph orch restart mon
Scheduled to restart mon.aka on host 'aka'
Scheduled to restart mon.balin on host 'balin'
Scheduled to restart mon.bifur on host 'bifur'
Scheduled to restart mon.bofur on host 'bofur'
Scheduled to restart mon.romolo on host 'romolo'

I manually restarted all the mon and mgr daemons on online hosts to  
no avail. At this point I am clueless, so any help is greatly  
appreciated.

Nicola

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx