Trying to resolve this, at first I tried to pause the cephadm processes ('ceph config-key set
mgr/cephadm/pause true') which did not lead anywhere but loss of connectivity: how do you "resume"?
Does not exist anywhere in the documentation!
Actually, there are quite many things in Ceph that you can switch on, but not off, of switch off, but
not on - such as rebooting a mgr node ...
In addition to the `ceph orch host ls` showing everything Offline, I thus managed to get also
> ceph -s
> id: 98e1e122-ebe3-11ec-b165-80000208fe80
> health: HEALTH_WARN
> 9 hosts fail cephadm check
> 21 stray daemon(s) not managed by cephadm
>
> services:
> mon: 5 daemons, quorum lxbk0374,lxbk0375,lxbk0376,lxbk0377,lxbk0378 (age 6d)
> mgr: lxbk0375.qtgomh(active, since 6d), standbys: lxbk0376.jstndr, lxbk0374.hdvmvg
> mds: 1/1 daemons up, 11 standby
> osd: 24 osds: 24 up (since 5d), 24 in (since 5d)
>
> data:
> volumes: 1/1 healthy
> pools: 3 pools, 641 pgs
> objects: 4.77k objects, 16 GiB
> usage: 50 GiB used, 909 TiB / 910 TiB avail
> pgs: 641 active+clean
Good thing is that neither ceph nor cephfs care about the orchestrator thingy - everything kees
working, it would seem ;-)
Finally, the workaround (or solution?):
Re-adding missing nodes is a bad idea in most every system, but not in Ceph.
Go to lxbk0375 - since that is the active mgr, cf. above.
> ssh-copy-id -f -i /etc/ceph/ceph.pub root@lxbk0374
> ceph orch host add lxbk0374 10.20.2.161
-> 'ceph orch host ls' shows that node no longer Offline.
-> Repeat with all the other hosts, and everything looks fine also from the orch view.
My question: Did I miss this procedure in the manuals?
Cheers
Thomas
On 23/06/2022 18.29, Thomas Roth wrote:
Hi all,
found this bug https://tracker.ceph.com/issues/51629 ; (Octopus 15.2.13), reproduced it in Pacific and
now again in Quincy:
- new cluster
- 3 mgr nodes
- reboot active mgr node
- (only in Quincy:) standby mgr node takes over, rebooted node becomse standby
- `ceph orch host ls` shows all hosts as `offline`
- add a new host: not offline
In my setup, hostnames and IPs are well known, thus
# ceph orch host ls
HOST ADDR LABELS STATUS
lxbk0374 10.20.2.161 _admin Offline
lxbk0375 10.20.2.162 Offline
lxbk0376 10.20.2.163 Offline
lxbk0377 10.20.2.164 Offline
lxbk0378 10.20.2.165 Offline
lxfs416 10.20.2.178 Offline
lxfs417 10.20.2.179 Offline
lxfs418 10.20.2.222 Offline
lxmds22 10.20.6.67
lxmds23 10.20.6.72 Offline
lxmds24 10.20.6.74 Offline
(All lxbk are mon nodes, the first 3 are mgr, 'lxmds22' was added after the fatal reboot.)
Does this matter at all?
The old bug report is one year old, now with prio 'Low'. And some people must have rebooted the one or
other host in their clusters...
There is a cephfs on our cluster, operations seem to be unaffected.
Cheers
Thomas
--
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 2.291
Phone: +49-6159-71 1453 Fax: +49-6159-71 2986
GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de
Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx