Hi everyone, I'm facing a weird issue with one of my pacific clusters. Brief into: - 5 Nodes Ubuntu 20.04. on 16.2.7 ( ceph01…05 ) - bootstrapped with cephadm recent image from quay.io (around 1 year ago) - approx. 200TB capacity 5% used - 5 OSD (2 HDD / 2 SSD / 1 NVMe) on each node - each node has a MON, yeah 5 MONs in charge - 3 RGW - 2 MGR - 3 MDS (2 active and 1 stby) The cluster is serving S3 files and cephFS for k8s PVCs and is doing very well. But: During a regular maintenance I found a heavy rotating store.db on EVERY node. Taking a further look, I found weird stuff going on in the #####.log The log is growing with a rate of approx. 400k/s and is rotating when reaching a certain size. store.db -rw-r--r-- 1 ceph ceph 11445745 Jan 13 09:53 1546576.log -rw-r--r-- 1 ceph ceph 67352998 Jan 13 09:53 1546578.sst -rw-r--r-- 1 ceph ceph 67349926 Jan 13 09:53 1546579.sst -rw-r--r-- 1 ceph ceph 67363989 Jan 13 09:53 1546580.sst -rw-r--r-- 1 ceph ceph 41063487 Jan 13 09:53 1546581.sst executing refresh((['ceph01', 'ceph02', 'ceph03', 'ceph04', 'ceph05'],)) failed. Traceback (most recent call last): File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line 48, in bootstrap_exec s = io.read(1) File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 402, in read raise EOFError("expected %d bytes, got %d" % (numbytes, len(buf))) EOFError: expected 1 bytes, got 0 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/serve.py", line 1357, in _remote_connection conn, connr = self.mgr._get_connection(addr) File "/usr/share/ceph/mgr/cephadm/module.py", line 1340, in _get_connection sudo=True if self.ssh_user != 'root' else False) File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 35, in __init__ self.gateway = self._make_gateway(hostname) File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 46, in _make_gateway self._make_connection_string(hostname) File "/lib/python3.6/site-packages/execnet/multi.py", line 134, in makegateway gw = gateway_bootstrap.bootstrap(io, spec) File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line 102, in bootstrap bootstrap_exec(io, spec) File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line 53, in bootstrap_exec raise HostNotFound(io.remoteaddress) execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-6p_ae5op -i /tmp/cephadm-identity-hc1rt28x ubuntuadmin@<< IP_OF_CEPH-01 REPLACED >> The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line 76, in do_work return f(*arg) File "/usr/share/ceph/mgr/cephadm/serve.py", line 312, in refresh with self._remote_connection(host) as tpl: File "/lib64/python3.6/contextlib.py", line 81, in __enter__ return next(self.gen) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1391, in _remote_connection raise OrchestratorError(msg) from e orchestrator._interface.OrchestratorError: Failed to connect to ceph01 << IP_OF_CEPH-01 REPLACED >>). Please make sure that the host is reachable and accepts connections using the cephadm SSH key ... ... [some binary stuff here] … ... ceph01.sjtrntß$Skd???>ö#?c????Z+Removing orphan daemon mds.cephfs.ceph02…cephadm ceph01.sjtrntß$Skd???>ö#?cXx??Z-Removing daemon mds.cephfs.ceph02 from ceph01cephadm ceph01.sjtrntß$Skd???>_#?cԕ?0?Z"Removing key for mds.cephfs.ceph02cephadm ceph01.sjtrntß$Skd???>_#?cUƾ0?Z=Reconfiguring mds.cephfs.ceph02 (unknown last config time)...cephadm ceph01.sjtrntß$Skd???>_#?cE?"2?Z0Reconfiguring daemon mds.cephfs.ceph02 on ceph01cephadm ceph01.sjtrntß$Skd???>`#?c??&?Zcephadm exited with an error code: 1, stderr:Non-zero exit code 1 from /usr/bin/docker container inspect --format ää.State.Status¨¨ ceph-<<cluster-ID REPLACED>>-mds-cephfs-ceph02 /usr/bin/docker: stdout /usr/bin/docker: stderr Error: No such container: ceph-<<cluster-ID REPLACED>>-mds-cephfs-ceph02 Non-zero exit code 1 from /usr/bin/docker container inspect --format ää.State.Status¨¨ ceph-<<cluster-ID REPLACED>>-mds.cephfs.ceph02 /usr/bin/docker: stdout /usr/bin/docker: stderr Error: No such container: ceph-<<cluster-ID REPLACED>>-mds.cephfs.ceph02 Reconfig daemon mds.cephfs.ceph02 ... ERROR: cannot reconfig, data path /var/lib/ceph/<<cluster-ID REPLACED>>/mds.cephfs.ceph02 does not exist Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in _remote_connection yield (conn, connr) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in _run_cephadm code, 'ön'.join(err))) orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:Non-zero exit code 1 from /usr/bin/docker container inspect --format ää.State.Status¨¨ ceph-<<cluster-ID REPLACED>>-mds-cephfs-ceph02 /usr/bin/docker: stdout /usr/bin/docker: stderr Error: No such container: ceph-<<cluster-ID REPLACED>>-mds-cephfs-ceph02 Non-zero exit code 1 from /usr/bin/docker container inspect --format ää.State.Status¨¨ ceph-<<cluster-ID REPLACED>>-mds.cephfs.ceph02 /usr/bin/docker: stdout /usr/bin/docker: stderr Error: No such container: ceph-<<cluster-ID REPLACED>>-mds.cephfs.ceph02 Reconfig daemon mds.cephfs.ceph02 ... ERROR: cannot reconfig, data path /var/lib/ceph/<<cluster-ID REPLACED>>/mds.cephfs.ceph02 does not existcephadm Unable to add a Daemon without Service. ?t Please use `ceph orch apply ...` to create a Service. Note, you might want to create the service with "unmanaged=true" Traceback (most recent call last): File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 125, in wrapper return OrchResult(f(*args, **kwargs)) File "/usr/share/ceph/mgr/cephadm/module.py", line 2440, in add_daemon ret.extend(self._add_daemon(d_type, spec)) File "/usr/share/ceph/mgr/cephadm/module.py", line 2378, in _add_daemon raise OrchestratorError('Unable to add a Daemon without Service.\n' orchestrator._interface.OrchestratorError: Unable to add a Daemon without Service. Please use `ceph orch apply ...` to create a Service. I’m confused about the attempt of cephadm to do „things“ to a ceph02 daemon which is obviously not residing on node ceph01. Almost the same log lines are appearing on each MON host in its store.db. All in all it looks fare from healthy and I’m really concerned about that. Any help is highly appreciated! Thanks a lot. Cheers, Jürgen _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx