Re: Ceph orch commands non-responsive after mgr/mon reboots 16.2.9

Tim Olow <tim@xxxxxxxx> · Mon, 25 Jul 2022 17:30:24 +0000

I just wanted to follow up on this issue as it corrected itself today.  I started a drain/remove on two hosts a few weeks back, after the rolling restart of mgr/mon on the cluster it seems that the ops queue either became locked or overwhelmed with requests.  I had a degraded PG during the rolling reboot of the mon/mgr and that seems to have blocked ceph orch, balancer, autoscale-status cli commands from returning.  I could see in the manager debug logs that balancer was indeed running and returning results internally from the internal scheduled process but the cli would hang indefinitely.   This morning the last degraded/offline PG got resolved and all commands are running again.

Moving forward is there a method to view the ops queue or monitor if the queue gets full and starts to deprioritize CLI commands?

Tim

On 7/22/22, 6:32 PM, "Tim Olow" <tim@xxxxxxxx> wrote:

    Howdy,

    I seem to be facing a problem on my 16.2.9 ceph cluster.  After a staggered reboot of my 3 infra nodes all of ceph orch commands are hanging much like in this previous reported issue [1]

    I have paused orch and rebuilt a manager by hand as outlined here [2], and the issue continues to persist.   I am unable to scale up or down of services, restart daemons, etc.

    ceph orch ls –verbose
    <snip>
    [{'flags': 8,
      'help': 'List services known to orchestrator',
      'module': 'mgr',
      'perm': 'r',
      'sig': [argdesc(<class 'ceph_argparse.CephPrefix'>, req=True, name=prefix, n=1, numseen=0, prefix=orch),
              argdesc(<class 'ceph_argparse.CephPrefix'>, req=True, name=prefix, n=1, numseen=0, prefix=ls),
              argdesc(<class 'ceph_argparse.CephString'>, req=False, name=service_type, n=1, numseen=0),
              argdesc(<class 'ceph_argparse.CephString'>, req=False, name=service_name, n=1, numseen=0),
              argdesc(<class 'ceph_argparse.CephBool'>, req=False, name=export, n=1, numseen=0),
              argdesc(<class 'ceph_argparse.CephChoices'>, req=False, name=format, n=1, numseen=0, strings=plain|json|json-pretty|yaml|xml-pretty|xml),
              argdesc(<class 'ceph_argparse.CephBool'>, req=False, name=refresh, n=1, numseen=0)]}]
    Submitting command:  {'prefix': 'orch ls', 'target': ('mon-mgr', '')}
    submit {"prefix": "orch ls", "target": ["mon-mgr", ""]} to mon-mgr

    <hang>

    Debug output on the manager:

    debug 2022-07-22T23:27:12.509+0000 7fc180230700  0 log_channel(audit) log [DBG] : from='client.1084220 -' entity='client.admin' cmd=[{"prefix": "orch ls", "target": ["mon-mgr", ""]}]: dispatch

    I have collected a startup of the manager and uploaded it for review [3]

    Many Thanks,

    Tim

    [1] https://www.spinics.net/lists/ceph-users/msg68398.html
    [2] https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon
    [3] https://pastebin.com/Dvb8sEbz

    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx