heavy rotation in store.db folder alongside with traces and exceptions in the .log

Jürgen Stawska <stawska@xxxxxxxxxxx> · Fri, 13 Jan 2023 11:22:08 +0100

Hi everyone,

I'm facing a weird issue with one of my pacific clusters.

Brief into:
- 5 Nodes Ubuntu 20.04. on 16.2.7 ( ceph01…05 )
- bootstrapped with cephadm recent image from quay.io (around 1 year ago) 
- approx. 200TB capacity 5% used
- 5 OSD (2 HDD / 2 SSD / 1 NVMe) on each node
- each node has a MON, yeah 5 MONs in charge
- 3 RGW
- 2 MGR
- 3 MDS (2 active and 1 stby)
The cluster is serving S3 files and cephFS for k8s PVCs and is doing very well.

But:

During a regular maintenance I found a heavy rotating store.db on EVERY node. Taking a further look, I found weird stuff going on in the #####.log 
The log is growing with a rate of approx. 400k/s and is rotating when reaching a certain size.

store.db
-rw-r--r-- 1 ceph ceph 11445745 Jan 13 09:53 1546576.log
-rw-r--r-- 1 ceph ceph 67352998 Jan 13 09:53 1546578.sst
-rw-r--r-- 1 ceph ceph 67349926 Jan 13 09:53 1546579.sst
-rw-r--r-- 1 ceph ceph 67363989 Jan 13 09:53 1546580.sst
-rw-r--r-- 1 ceph ceph 41063487 Jan 13 09:53 1546581.sst

executing refresh((['ceph01', 'ceph02', 'ceph03', 'ceph04', 'ceph05'],)) failed.
Traceback (most recent call last):
  File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line 48, in bootstrap_exec
    s = io.read(1)
  File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 402, in read
    raise EOFError("expected %d bytes, got %d" % (numbytes, len(buf)))
EOFError: expected 1 bytes, got 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1357, in _remote_connection
    conn, connr = self.mgr._get_connection(addr)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1340, in _get_connection
    sudo=True if self.ssh_user != 'root' else False)
  File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 35, in __init__
    self.gateway = self._make_gateway(hostname)
  File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 46, in _make_gateway
    self._make_connection_string(hostname)
  File "/lib/python3.6/site-packages/execnet/multi.py", line 134, in makegateway
    gw = gateway_bootstrap.bootstrap(io, spec)
  File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line 102, in bootstrap
    bootstrap_exec(io, spec)
  File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line 53, in bootstrap_exec
    raise HostNotFound(io.remoteaddress)
execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-6p_ae5op -i /tmp/cephadm-identity-hc1rt28x ubuntuadmin@<< IP_OF_CEPH-01 REPLACED >>

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/utils.py", line 76, in do_work
    return f(*arg)
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 312, in refresh
    with self._remote_connection(host) as tpl:
  File "/lib64/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1391, in _remote_connection
    raise OrchestratorError(msg) from e
orchestrator._interface.OrchestratorError: Failed to connect to ceph01 << IP_OF_CEPH-01 REPLACED >>).
Please make sure that the host is reachable and accepts connections using the cephadm SSH key
...
... [some binary stuff here] …
...
ceph01.sjtrntß$Skd???>ö#?c????Z+Removing orphan daemon mds.cephfs.ceph02…cephadm
ceph01.sjtrntß$Skd???>ö#?cXx??Z-Removing daemon mds.cephfs.ceph02 from ceph01cephadm
ceph01.sjtrntß$Skd???>_#?cԕ?0?Z"Removing key for mds.cephfs.ceph02cephadm
ceph01.sjtrntß$Skd???>_#?cUƾ0?Z=Reconfiguring mds.cephfs.ceph02 (unknown last config time)...cephadm
ceph01.sjtrntß$Skd???>_#?cE?"2?Z0Reconfiguring daemon mds.cephfs.ceph02 on ceph01cephadm
ceph01.sjtrntß$Skd???>`#?c??&?Zcephadm exited with an error code: 1, stderr:Non-zero exit code 1 from /usr/bin/docker container inspect --format ää.State.Status¨¨ ceph-<<cluster-ID REPLACED>>-mds-cephfs-ceph02
/usr/bin/docker: stdout 
/usr/bin/docker: stderr Error: No such container: ceph-<<cluster-ID REPLACED>>-mds-cephfs-ceph02
Non-zero exit code 1 from /usr/bin/docker container inspect --format ää.State.Status¨¨ ceph-<<cluster-ID REPLACED>>-mds.cephfs.ceph02
/usr/bin/docker: stdout 
/usr/bin/docker: stderr Error: No such container: ceph-<<cluster-ID REPLACED>>-mds.cephfs.ceph02
Reconfig daemon mds.cephfs.ceph02 ...
ERROR: cannot reconfig, data path /var/lib/ceph/<<cluster-ID REPLACED>>/mds.cephfs.ceph02 does not exist
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in _remote_connection
    yield (conn, connr)
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in _run_cephadm
    code, 'ön'.join(err)))
orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:Non-zero exit code 1 from /usr/bin/docker container inspect --format ää.State.Status¨¨ ceph-<<cluster-ID REPLACED>>-mds-cephfs-ceph02
/usr/bin/docker: stdout 
/usr/bin/docker: stderr Error: No such container: ceph-<<cluster-ID REPLACED>>-mds-cephfs-ceph02
Non-zero exit code 1 from /usr/bin/docker container inspect --format ää.State.Status¨¨ ceph-<<cluster-ID REPLACED>>-mds.cephfs.ceph02
/usr/bin/docker: stdout 
/usr/bin/docker: stderr Error: No such container: ceph-<<cluster-ID REPLACED>>-mds.cephfs.ceph02
Reconfig daemon mds.cephfs.ceph02 ...
ERROR: cannot reconfig, data path /var/lib/ceph/<<cluster-ID REPLACED>>/mds.cephfs.ceph02 does not existcephadm
Unable to add a Daemon without Service.                                                                                     ?t
Please use `ceph orch apply ...` to create a Service.
Note, you might want to create the service with "unmanaged=true"
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 125, in wrapper
    return OrchResult(f(*args, **kwargs))
  File "/usr/share/ceph/mgr/cephadm/module.py", line 2440, in add_daemon
    ret.extend(self._add_daemon(d_type, spec))
  File "/usr/share/ceph/mgr/cephadm/module.py", line 2378, in _add_daemon
    raise OrchestratorError('Unable to add a Daemon without Service.\n'
orchestrator._interface.OrchestratorError: Unable to add a Daemon without Service.
Please use `ceph orch apply ...` to create a Service.

I’m confused about the attempt of cephadm to do „things“ to a ceph02 daemon which is obviously not residing on node ceph01. Almost the same log lines are appearing on each MON host in its store.db.
All in all it looks fare from healthy and I’m really concerned about that.
Any help is highly appreciated! Thanks a lot.

Cheers, 
Jürgen

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx