Dear all, We're having an odd problem with a recently installed Quincy/cephadm cluster on CentOS 8 Stream with Podman, where the orchestrator appears to get wedged and just won't implement any changes. The overall cluster was installed and working for a few weeks, then we added an NFS export which worked for a bit, then we had some problems with that and tried to restart/redeploy it and found that the orchestrator wouldn't deploy new NFS server containers. We then made an attempt to restart the MGR process(es) by stopping one and having the Orchestrator redeploy it, but it didn't. The overall effect looks like orchestrator won't try to start containers - it knows what it's supposed to be doing (and you can tell it to do new things, e.g. deploy a new NFS cluster, and that's reflected correctly in both CLI and web control panel), but it just doesn't actually deploy things. This looks a bit like this Reddit post: https://www.reddit.com/r/ceph/comments/v3kdix/cephadm_not_deploying_new_mgr_daemons_to_match/ And this mailing list post: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YREK7HUBNIMTKR5GU5L5E5CNFI7FDKLF/ We've found that in our case this appears to be due to one node in the cluster being in a strange state; it's not always the same node, and it doesn't have to be either the node running the MGR, or the node(s) being targetted to start the new containers on, *any* node in the system being in this state will wedge the orchestrator. A 'stuck' node can't start a new local container with './cephadm shell', and it sometimes but not always appears in the Cluster->Hosts section of the web interface with a blank machine model name and 'NaN' for its capacity (I'm guessing that these values are cached and then time out after a while?). Already running containers on the node (e.g. OSDs) appear to carry on working. As well as failing to start containers, while in this state the orchestrator will also fail to copy /etc/ceph/* to a new node with the '_admin' tag. Rebooting the 'stuck' node instantly unwedges the orchestrator as soon as the 'stuck' node goes down - it doesn't have to come up working, it just has to stop - as soon as the 'stuck' node is down the orchestrator catches up on outstanding requests, starts new containers and brings everything into line with the requested state. My current best guess is that the first thing the orchestrator tries to do when it needs to change something is to enumerate the nodes to see where it can start things, it hangs trying to query the 'stuck' node, and can't recover on its own. I've included more details and command outputs below, but: - Does that sound feasible? - Does this sound familiar to anyone? - Does anyone know how to fix it? - Or how to narrow down the root cause to turn this into a proper bug report? Ewan While in the wedged state the orchestrator thinks it's fine: # ceph orch status --detail Backend: cephadm Available: Yes Paused: No Host Parallelism: 10 The cluster overall health is fine: # ceph -s cluster: id: 58140ed2-4ed4-11ed-b4db-5c6f69756a60 health: HEALTH_OK services: mon: 5 daemons, quorum ceph-r3n4,ceph-r1n4,ceph-r2n4,ceph-r1n5,ceph-r2n5 (age 2w) mgr: ceph-r1n4.mgqrwx(active, since 2d) mds: 1/1 daemons up, 3 standby osd: 294 osds: 294 up (since 2w), 294 in (since 5w) data: volumes: 1/1 healthy pools: 5 pools, 9281 pgs objects: 252.26M objects, 649 TiB usage: 2.1 PiB used, 2.2 PiB / 4.3 PiB avail pgs: 9269 active+clean 12 active+clean+scrubbing+deep This was tested by deploying new NFS serice in existing cluster and then a whole new NFS cluster (nfs.cephnfstwo) and removing the original NFS cluster (nfs.cephnfsone) - the change made from CLI reflected in Web dashboard, but not actioned; e.g. only one MGR running of target two, nothing running for NFS cluster 'cephnfstwo', original 'nfs.cephnfsone' cluster shown as 'deleting' but not acutally gone: # ceph orch ls NAME PORTS RUNNING REFRESHED AGE PLACEMENT alertmanager ?:9093,9094 1/1 2d ago 5w count:1 crash 21/21 4w ago 5w * grafana ?:3000 1/1 2d ago 5w count:1 mds.mds1 4/4 4w ago 5w count:4 mgr 1/2 2w ago 2w count:2 mon 5/5 2w ago 5w count:5 nfs.cephnfsone 0/1 <deleting> 2w ceph-r1n5;ceph-r2n5;count:1 nfs.cephnfstwo ?:2049 0/1 - 2d count:1 node-exporter ?:9100 21/21 4w ago 5w * osd 294 4w ago - <unmanaged> prometheus ?:9095 1/1 2d ago 5w count:1 The 'stuck' node is continuing to run services including OSDs, and in cases where it's also had the active MGR then the web interface remains accessible, but SSHing in and trying to start a cephadm shell hangs with no output, and when interrupted with ctrl-C gives python errors: # ./cephadm shell ^CTraceback (most recent call last): File "./cephadm", line 9491, in <module> main() File "./cephadm", line 9479, in main r = ctx.func(ctx) File "./cephadm", line 2083, in _infer_config return func(ctx) Loop <_UnixSelectorEventLoop running=False closed=True debug=False> that handles pid 110551 is closed File "./cephadm", line 2007, in _infer_fsid daemon_list = list_daemons(ctx, detail=False) File "./cephadm", line 6298, in list_daemons verbosity=CallVerbosity.QUIET File "./cephadm", line 1764, in call stdout, stderr, returncode = async_run(run_with_timeout()) File "./cephadm", line 1709, in async_run return loop.run_until_complete(coro) File "/usr/lib64/python3.6/asyncio/base_events.py", line 471, in run_until_complete self.run_forever() File "/usr/lib64/python3.6/asyncio/base_events.py", line 438, in run_forever self._run_once() File "/usr/lib64/python3.6/asyncio/base_events.py", line 1415, in _run_once event_list = self._selector.select(timeout) File "/usr/lib64/python3.6/selectors.py", line 445, in select fd_event_list = self._epoll.poll(timeout, max_ev) KeyboardInterrupt Exception ignored in: <bound method BaseSubprocessTransport.__del__ of <_UnixSubprocessTransport closed pid=110551 running stdout=<_UnixReadPipeTransport closing fd=7 open> stderr=<_UnixReadPipeTransport fd=9 open>>> Traceback (most recent call last): File "/usr/lib64/python3.6/asyncio/base_subprocess.py", line 132, in __del__ File "/usr/lib64/python3.6/asyncio/base_subprocess.py", line 106, in close File "/usr/lib64/python3.6/asyncio/unix_events.py", line 423, in close File "/usr/lib64/python3.6/asyncio/unix_events.py", line 452, in _close File "/usr/lib64/python3.6/asyncio/base_events.py", line 591, in call_soon File "/usr/lib64/python3.6/asyncio/base_events.py", line 377, in _check_closed RuntimeError: Event loop is closed _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx