No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

Paul Choi <pchoi@xxxxxxx> · Fri, 20 Mar 2020 06:33:39 -1000

Hello,

We are running Mimic 13.2.8 with our cluster, and since upgrading to 13.2.8
the Prometheus plugin seems to hang a lot. It used to respond under 10s but
now it often hangs. Restarting the mgr processes helps temporarily but
within minutes it gets stuck again.

The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target" and
needs to
 be kill -9'ed.

Is there anything I can do to address this issue, or at least get better
visibility into the issue?

We only have a few plugins enabled:
$ ceph mgr module ls
{
    "enabled_modules": [
        "balancer",
        "prometheus",
        "zabbix"
    ],

3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and it's
a busy one with lots of rebalancing. (I don't know if a busy cluster would
seriously affect the mgr's performance, but just throwing it out there)

  services:
    mon: 5 daemons, quorum
woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
    mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1
    mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1 up:standby-replay
    osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
    rgw: 4 daemons active

Thanks in advance for your help,

-Paul Choi
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx