Re: cephadm orch thinks hosts are offline

Thomas Roth <t.roth@xxxxxx> · Wed, 29 Jun 2022 14:04:15 +0200

Trying to resolve this, at first I tried to pause the cephadm processes ('ceph config-key set 
mgr/cephadm/pause true') which did not lead anywhere but loss of connectivity: how do you "resume"? 
Does not exist anywhere in the documentation!
Actually, there are quite many things in Ceph that you can switch on, but not off, of switch off, but 
not on - such as rebooting a mgr node ...

In addition to the `ceph orch host ls` showing everything Offline, I thus managed to get also
> ceph -s
>     id:     98e1e122-ebe3-11ec-b165-80000208fe80
>    health: HEALTH_WARN
>            9 hosts fail cephadm check
>            21 stray daemon(s) not managed by cephadm
>
>  services:
>    mon: 5 daemons, quorum lxbk0374,lxbk0375,lxbk0376,lxbk0377,lxbk0378 (age 6d)
>    mgr: lxbk0375.qtgomh(active, since 6d), standbys: lxbk0376.jstndr, lxbk0374.hdvmvg
>    mds: 1/1 daemons up, 11 standby
>    osd: 24 osds: 24 up (since 5d), 24 in (since 5d)
>
>  data:
>    volumes: 1/1 healthy
>    pools:   3 pools, 641 pgs
>    objects: 4.77k objects, 16 GiB
>    usage:   50 GiB used, 909 TiB / 910 TiB avail
>    pgs:     641 active+clean

Good thing is that neither ceph nor cephfs care about the orchestrator thingy - everything kees 
working, it would seem ;-)

Finally, the workaround (or solution?):
Re-adding missing nodes is a bad idea in most every system, but not in Ceph.

Go to lxbk0375 - since that is the active mgr, cf. above.

> ssh-copy-id -f -i /etc/ceph/ceph.pub root@lxbk0374
> ceph orch host add lxbk0374 10.20.2.161

-> 'ceph orch host ls' shows that node no longer Offline.
-> Repeat with all the other hosts, and everything looks fine also from the orch view.

My question: Did I miss this procedure in the manuals?

Cheers
Thomas

On 23/06/2022 18.29, Thomas Roth wrote:
Hi all,

found this bug https://tracker.ceph.com/issues/51629 ; (Octopus 15.2.13), reproduced it in Pacific and 
now again in Quincy:
- new cluster
- 3 mgr nodes
- reboot active mgr node
- (only in Quincy:) standby mgr node takes over, rebooted node becomse standby
- `ceph orch host ls` shows all hosts as `offline`
- add a new host: not offline

In my setup, hostnames and IPs are well known, thus

# ceph orch host ls
HOST      ADDR         LABELS  STATUS
lxbk0374  10.20.2.161  _admin  Offline
lxbk0375  10.20.2.162          Offline
lxbk0376  10.20.2.163          Offline
lxbk0377  10.20.2.164          Offline
lxbk0378  10.20.2.165          Offline
lxfs416   10.20.2.178          Offline
lxfs417   10.20.2.179          Offline
lxfs418   10.20.2.222          Offline
lxmds22   10.20.6.67
lxmds23   10.20.6.72           Offline
lxmds24   10.20.6.74           Offline

(All lxbk are mon nodes, the first 3 are mgr, 'lxmds22' was added after the fatal reboot.)

Does this matter at all?
The old bug report is one year old, now with prio 'Low'. And some people must have rebooted the one or 
other host in their clusters...

There is a cephfs on our cluster, operations seem to be unaffected.

Cheers
Thomas

--
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 2.291
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx