Orchestrator hanging on 'stuck' nodes

Ewan Mac Mahon <ewan.macmahon@xxxxxxxxxxxx> · Tue, 6 Dec 2022 13:28:38 +0000

Dear all,

We're having an odd problem with a recently installed Quincy/cephadm cluster on CentOS 8 Stream with Podman, where the orchestrator appears to get wedged and just won't implement any changes. 

The overall cluster was installed and working for a few weeks, then we added an NFS export which worked for a bit, then we had some problems with that and tried to restart/redeploy it and found that the orchestrator wouldn't deploy new NFS server containers. We then made an attempt to restart the MGR process(es) by stopping one and having the Orchestrator redeploy it, but it didn't. The overall effect looks like orchestrator won't try to start containers - it knows what it's supposed to be doing (and you can tell it to do new things, e.g. deploy a new NFS cluster, and that's reflected correctly in both CLI and web control panel), but it just doesn't actually deploy things.

This looks a bit like this Reddit post: https://www.reddit.com/r/ceph/comments/v3kdix/cephadm_not_deploying_new_mgr_daemons_to_match/
And this mailing list post: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YREK7HUBNIMTKR5GU5L5E5CNFI7FDKLF/

We've found that in our case this appears to be due to one node in the cluster being in a strange state; it's not always the same node, and it doesn't have to be either the node running the MGR, or the node(s) being targetted to start the new containers on, *any* node in the system being in this state will wedge the orchestrator. A 'stuck' node can't start a new local container with './cephadm shell', and it sometimes but not always appears in the Cluster->Hosts section of the web interface with a blank machine model name and 'NaN' for its capacity (I'm guessing that these values are cached and then time out after a while?). Already running  containers on the node (e.g. OSDs) appear to carry on working.

As well as failing to start containers, while in this state the orchestrator will also fail to copy /etc/ceph/* to a new node with the '_admin' tag. Rebooting the 'stuck' node instantly unwedges the orchestrator as soon as the 'stuck' node goes down - it doesn't have to come up working, it just has to stop - as soon as the 'stuck' node is down the orchestrator catches up on outstanding requests, starts new containers and brings everything into line with the requested state. 

My current best guess is that the first thing the orchestrator tries to do when it needs to change something is to enumerate the nodes to see where it can start things, it hangs trying to query the 'stuck' node, and can't recover on its own.

I've included more details and command outputs below, but:

- Does that sound feasible?
- Does this sound familiar to anyone?
- Does anyone know how to fix it?
- Or how to narrow down the root cause to turn this into a proper bug report?

Ewan

While in the wedged state the orchestrator thinks it's fine:

# ceph orch status --detail
Backend: cephadm
Available: Yes
Paused: No
Host Parallelism: 10

The cluster overall health is fine:

# ceph -s
  cluster:
    id:     58140ed2-4ed4-11ed-b4db-5c6f69756a60
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum ceph-r3n4,ceph-r1n4,ceph-r2n4,ceph-r1n5,ceph-r2n5 (age 2w)
    mgr: ceph-r1n4.mgqrwx(active, since 2d)
    mds: 1/1 daemons up, 3 standby
    osd: 294 osds: 294 up (since 2w), 294 in (since 5w)

  data:
    volumes: 1/1 healthy
    pools:   5 pools, 9281 pgs
    objects: 252.26M objects, 649 TiB
    usage:   2.1 PiB used, 2.2 PiB / 4.3 PiB avail
    pgs:     9269 active+clean
             12   active+clean+scrubbing+deep

This was tested by deploying new NFS serice in existing cluster and then a whole new NFS cluster (nfs.cephnfstwo) and removing the original NFS cluster (nfs.cephnfsone) - the change made from CLI reflected in Web dashboard, but not actioned; e.g. only one MGR running of target two, nothing running for NFS cluster 'cephnfstwo', original 'nfs.cephnfsone' cluster shown as 'deleting' but not acutally gone:

# ceph orch ls
NAME            PORTS        RUNNING  REFRESHED   AGE  PLACEMENT
alertmanager    ?:9093,9094      1/1  2d ago      5w   count:1
crash                          21/21  4w ago      5w   *
grafana         ?:3000           1/1  2d ago      5w   count:1
mds.mds1                         4/4  4w ago      5w   count:4
mgr                              1/2  2w ago      2w   count:2
mon                              5/5  2w ago      5w   count:5
nfs.cephnfsone                   0/1  <deleting>  2w   ceph-r1n5;ceph-r2n5;count:1
nfs.cephnfstwo  ?:2049           0/1  -           2d   count:1
node-exporter   ?:9100         21/21  4w ago      5w   *
osd                              294  4w ago      -    <unmanaged>
prometheus      ?:9095           1/1  2d ago      5w   count:1

The 'stuck' node is continuing to run services including OSDs, and in cases where it's also had the active MGR then the web interface remains accessible, but SSHing in and trying to start a cephadm shell hangs with no output, and when interrupted with ctrl-C gives python errors:

# ./cephadm shell
^CTraceback (most recent call last):
  File "./cephadm", line 9491, in <module>
    main()
  File "./cephadm", line 9479, in main
    r = ctx.func(ctx)
  File "./cephadm", line 2083, in _infer_config
    return func(ctx)
Loop <_UnixSelectorEventLoop running=False closed=True debug=False> that handles pid 110551 is closed
  File "./cephadm", line 2007, in _infer_fsid
    daemon_list = list_daemons(ctx, detail=False)
  File "./cephadm", line 6298, in list_daemons
    verbosity=CallVerbosity.QUIET
  File "./cephadm", line 1764, in call
    stdout, stderr, returncode = async_run(run_with_timeout())
  File "./cephadm", line 1709, in async_run
    return loop.run_until_complete(coro)
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 471, in run_until_complete
    self.run_forever()
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 438, in run_forever
    self._run_once()
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 1415, in _run_once
    event_list = self._selector.select(timeout)
  File "/usr/lib64/python3.6/selectors.py", line 445, in select
    fd_event_list = self._epoll.poll(timeout, max_ev)
KeyboardInterrupt
Exception ignored in: <bound method BaseSubprocessTransport.__del__ of <_UnixSubprocessTransport closed pid=110551 running stdout=<_UnixReadPipeTransport closing fd=7 open> stderr=<_UnixReadPipeTransport fd=9 open>>>
Traceback (most recent call last):
  File "/usr/lib64/python3.6/asyncio/base_subprocess.py", line 132, in __del__
  File "/usr/lib64/python3.6/asyncio/base_subprocess.py", line 106, in close
  File "/usr/lib64/python3.6/asyncio/unix_events.py", line 423, in close
  File "/usr/lib64/python3.6/asyncio/unix_events.py", line 452, in _close
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 591, in call_soon
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 377, in _check_closed
RuntimeError: Event loop is closed
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx