Dear gents, to get handy with cephadm upgrade path and in general (we heavily use old style "ceph-deploy" Octopus based production clusters), we decided to do some tests with a vanilla cluster running 15.2.11 based on Centos8 on top of vSphere. Deployment of Octopus cluster runs very well and we are excited about this new technique and all the possibilities. No errors no clues... :-) Unfortunately upgrade fails to Pacific (16.2.0 or 16.2.1) either original docker or quay.ceph.io/ceph-ci/ceph:pacific images all the time. We use a small setup (3 mons, 2 mgrs, some osds) This is the upgrade behaviour: Upgrade of both MGR's seems to be ok but we get this: 2021-04-29T15:35:19.903111+0200 mgr.c0n00.vnxaqu [DBG] daemon mgr.c0n00.vnxaqu container digest correct 2021-04-29T15:35:19.903206+0200 mgr.c0n00.vnxaqu [DBG] daemon mgr.c0n00.vnxaqu deployed by correct version 2021-04-29T15:35:19.903298+0200 mgr.c0n00.vnxaqu [DBG] daemon mgr.c0n01.gstlmw container digest correct 2021-04-29T15:35:19.903378+0200 mgr.c0n00.vnxaqu [DBG] daemon mgr.c0n01.gstlmw *not deployed by correct version* After this the upgrade process stucks completely. Although you have a running cluster (minus one monitor daemon): [root@c0n00 ~]# ceph -s cluster: id: 5541c866-a8fe-11eb-b604-005056b8f1bf health: HEALTH_WARN * 3 hosts fail cephadm check* services: mon: 2 daemons, quorum c0n00,c0n02 (age 68m) mgr: c0n00.bmtvpr(active, since 68m), standbys: c0n01.jwfuca osd: 4 osds: 4 up (since 63m), 4 in (since 62m) [..] progress: Upgrade to 16.2.1-257-g717ce59b (0s) [=...........................] { "target_image": " quay.ceph.io/ceph-ci/ceph@sha256:d0f624287378fe63fc4c30bccc9f82bfe0e42e62381c0a3d0d3d86d985f5d788", "in_progress": true, "services_complete": [ "mgr" ], "progress": "2/19 ceph daemons upgraded", "message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an unexpected exception" [root@c0n00 ~]# ceph orch ps NAME HOST PORTS STATUS REFRESHED AGE VERSION IMAGE ID CONTAINER ID alertmanager.c0n00 c0n00 running (56m) 4m ago 16h 0.20.0 0881eb8f169f 30d9eff06ce2 crash.c0n00 c0n00 running (56m) 4m ago 16h 15.2.11 9d01da634b8f 91d3e4d0e14d crash.c0n01 c0n01 host is offline 16h ago 16h 15.2.11 9d01da634b8f 0ff4a20021df crash.c0n02 c0n02 host is offline 16h ago 16h 15.2.11 9d01da634b8f 0253e6bb29a0 crash.c0n03 c0n03 host is offline 16h ago 16h 15.2.11 9d01da634b8f 291ce4f8b854 grafana.c0n00 c0n00 running (56m) 4m ago 16h 6.7.4 80728b29ad3f 46d77b695da5 mgr.c0n00.bmtvpr c0n00 *:8443,9283 running (56m) 4m ago 16h 16.2.1-257-g717ce59b 3be927f015dd 94a7008ccb4f mgr.c0n01.jwfuca c0n01 host is offline 16h ago 16h 16.2.1-257-g717ce59b 3be927f015dd 766ada65efa9 mon.c0n00 c0n00 running (56m) 4m ago 16h 15.2.11 9d01da634b8f b9f270cd99e2 mon.c0n02 c0n02 host is offline 16h ago 16h 15.2.11 9d01da634b8f a90c21bfd49e node-exporter.c0n00 c0n00 running (56m) 4m ago 16h 0.18.1 e5a616e4b9cf eb1306811c6c node-exporter.c0n01 c0n01 host is offline 16h ago 16h 0.18.1 e5a616e4b9cf 093a72542d3e node-exporter.c0n02 c0n02 host is offline 16h ago 16h 0.18.1 e5a616e4b9cf 785531f5d6cf node-exporter.c0n03 c0n03 host is offline 16h ago 16h 0.18.1 e5a616e4b9cf 074fac77e17c osd.0 c0n02 host is offline 16h ago 16h 15.2.11 9d01da634b8f c075bd047c0a osd.1 c0n01 host is offline 16h ago 16h 15.2.11 9d01da634b8f 616aeda28504 osd.2 c0n03 host is offline 16h ago 16h 15.2.11 9d01da634b8f b36453730c83 osd.3 c0n00 running (56m) 4m ago 16h 15.2.11 9d01da634b8f e043abf53206 prometheus.c0n00 c0n00 running (56m) 4m ago 16h 2.18.1 de242295e225 7cb50c04e26a After some digging into daemon logs we found Tracebacks (please see below). We also noticed that we successfully reach each host per ssh -F .... !!! We've done tcpdumps while upgrading and every SYN gets its SYNACK... ;-) Because we get no errors while deploying fresh Octopus cluster by cephadm (from https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadm and cephadm prepare host is always OK) it might be a missing Python Lib or something that's not checked cephadm itself? Thank you for any hint. Christoph Ackermann Traceback: Traceback (most recent call last): File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line 48, in bootstrap_exec s = io.read(1) File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 402, in read raise EOFError("expected %d bytes, got %d" % (numbytes, len(buf))) EOFError: expected 1 bytes, got 0 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/serve.py", line 1166, in _remote_connection conn, connr = self.mgr._get_connection(addr) File "/usr/share/ceph/mgr/cephadm/module.py", line 1202, in _get_connection sudo=True if self.ssh_user != 'root' else False) File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 34, in __init__ self.gateway = self._make_gateway(hostname) File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 44, in _make_gateway self._make_connection_string(hostname) File "/lib/python3.6/site-packages/execnet/multi.py", line 134, in makegateway gw = gateway_bootstrap.bootstrap(io, spec) File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line 102, in bootstrap bootstrap_exec(io, spec) File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line 53, in bootstrap_exec raise HostNotFound(io.remoteaddress) execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-61otabz_ -i /tmp/cephadm-identity-rt2nm0t4 root@c0n02 The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line 73, in do_work return f(*arg) File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 60, in create_from_spec_one replace_osd_ids=osd_id_claims.get(host, []), env_vars=env_vars File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 75, in create_single_host out, err, code = self._run_ceph_volume_command(host, cmd, env_vars=env_vars) File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 295, in _run_ceph_volume_command error_ok=True) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1003, in _run_cephadm with self._remote_connection(host, addr) as tpl: File "/lib64/python3.6/contextlib.py", line 81, in __enter__ return next(self.gen) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1197, in _remote_connection raise OrchestratorError(msg) from e orchestrator._interface.OrchestratorError: Failed to connect to c0n02 (c0n02). Please make sure that the host is reachable and accepts connections using the cephadm SSH key To add the cephadm SSH key to the host: > ceph cephadm get-pub-key > ~/ceph.pub > ssh-copy-id -f -i ~/ceph.pub root@c0n02 To check that the host is reachable: > ceph cephadm get-ssh-config > ssh_config > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key > chmod 0600 ~/cephadm_private_key > ssh -F ssh_config -i ~/cephadm_private_key root@c0n02 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx